Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Agrikk posted:

What are the implications of using a composite primary key versus creating an autoincrementing column for a PK and then creating an index on the columns? Why would one use one method over the other?

For example:

If I have data [colA, colB, colC, colD...] that is unique for [colA,colB] what are the implications of creating a table with a composite primary key of [colA, colB] versus creating an autoincrementing col0 and then creating a secondary index on [colA, colB]?

The short answer is that composite primary keys are essentially never a good idea. For essentially all practical purposes, a composite primary key has nothing but downsides. I'd actually go even further than that and say that natural primary keys in general are pretty much always a bad idea and are best avoided.

The hypothetical upside of a composite primary key is that you save some small amount of data per row because you avoid an extra column and its index. This is quite meaningless though, not least because a composite primary key will add at least one extra column to every row in all other tables that have a foreign key referencing it. In postgres the differences basically stop there, as far as indexing and index usage goes there's really no difference between a primary key and any other unique index AFAIK. However, in certain databases (MSSQL for example), the primary key is typically clustered, meaning that its ordering controls the physical ordering of the table pages on disk, and that the index pages for the clustered index also are the data pages for the entire table, which in turn means that finding a row by its primary key saves you one index read compared to looking it up by another index. In most cases this isn't really meaningful though, and clustering by a composite primary key also has a bunch of caveats to it. There's also the possibility to include non-indexed column data to any index (making it a covering index) if you have some particularly critical use case where you really need to save that one extra read.

More generally, in the chaotic environment of human systems natural primary keys have an unsettling tendency to stop being natural primary keys. Primary keys must be guaranteed to be stable and immutable, but people tend to want to change data in databases at some point, so years down the line someone usually comes along and changes the requirements and now the previously natural primary key is an enormous pain you need to migrate away from. The general principle here is that identifiers should not carry information. I've hurfed and durfed about this ITT before so I'll stop there.

TheFluff fucked around with this message at 20:19 on Jun 10, 2022

Adbot
ADBOT LOVES YOU

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

TheFluff posted:

The general principle here is that identifiers should not carry information.

I think this is the easiest point to remember.

So as a rule of thumb, good table design should be some kind of row identifier, like an autoincrementing integer, and use secondary indexes for actual performance considerations.

Is that correct?

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Agrikk posted:

I think this is the easiest point to remember.

So as a rule of thumb, good table design should be some kind of row identifier, like an autoincrementing integer, and use secondary indexes for actual performance considerations.

Is that correct?

Yep, you got it, and don't forget that adding unique indexes just for data integrity reasons is usually a pretty good idea too. They can of course be multi-column too if you need it.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
To reduce my own durfhurfing, I'll merely note that composite property keys are to be avoided as reasoned above. Composite ID keys, however, may well be necessary and appropriate.

EG, (course INT, student INT) is preferred over an arbitrary course_student_id. (These are the internal integers, not the ID printed on the student's grade report.) imho.

(I'm a strong proponent of an integer ID field (called 'table.id', not 'table.table_id'), but use a database that always puts IDs on rows in the pages, so even if a composite ends up in a weird state it can be repaired and the extra ID field is just a waste during joins.)

abelwingnut
Dec 23, 2002


anyone migrated on-prem databases to aws rds? got a potential client who wants to migrate their 37 oracle databases to aws.

i've never dealt with oracle before or migrated a database to aws. on top of that, they want this done by the end of the year. given all of this, i'm not sure if this is something i want to involve myself with. i feel like there's a lot of uncertainty given their setup and my experience. i've only really messed with sql server.

i'd have some help with some of their people, but...still kind of wary. maybe i'm just being too cautious and this is fairly easy?

pays really well, so i've got to ask.

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

Agrikk posted:

What are the implications of using a composite primary key versus creating an autoincrementing column for a PK and then creating an index on the columns? Why would one use one method over the other?

For example:

If I have data [colA, colB, colC, colD...] that is unique for [colA,colB] what are the implications of creating a table with a composite primary key of [colA, colB] versus creating an autoincrementing col0 and then creating a secondary index on [colA, colB]?

My philosophy is always use a surrogate key, for various reasons.

Primary keys should never change in a database, and that's still true of composite primary keys, you think you'll never want to change that business key but eventually you will run into some weird use-case that requires you to. At that point you'll have to go through and update all associated foreign-key references to that business key, etc. You have denormalized your data across a ton of tables. Can you do it, yes, but surrogate key doesn't really do anything worse other than a slight performance/memory/size hit from having another index.

They're also loving annoying to work with on every level... if you have three values, then each of your tables that reference it will have to have three columns of their own devoted to just that single many-to-one reference. And they're a pain in an ORM because you have to set up these MyClassId(varA,varB,varC) identifiers everywhere you want to look something up, there's nothing wrong with it but it generates a lot more boilerplate to write and to interact with on an ongoing basis. And I don't know how you'd serialize them to (eg) JSON...three fields on the object? a little identifier object like "myId : {varA: x, varB: y, varC: z}"?

Having done them both ways, composite primary keys are a thumbs down from me, they're a big pain all around. Absolutely surrogate + unique index for business.

Agrikk posted:

I think this is the easiest point to remember.

So as a rule of thumb, good table design should be some kind of row identifier, like an autoincrementing integer, and use secondary indexes for actual performance considerations.

Is that correct?

Sequence generators work mostly fine if you obey the rules and don't do stupid poo poo. The biggest rule is "do not assign semantic meaning to the sequence value": to make sequences work, you realistically need to use per-connection caching (every connection gets N sequence values at once when it increments the sequence so you don't hit disk every time). Because of this caching it is not chronological, a record with a higher sequence value may have come earlier. You shall not care about gaps or holes (every time you restart the client/drop a connection, you will increment the counter to get resources for that thread, and it doesn't ever roll back) or how fast you burn it (64b is a long-rear end time). If you need a value with semantic meaning that you don't want to have gaps ("item number X") then that is not your primary key, and you have a "gap-less" generator in a table or similar. If you need a chronological ordering, use a timestamp column. Don't write ID allocators into your default constructor because you're too lazy to debug which code flows aren't assigning a primary key and "the value will just get overwritten during the load if it's a stored object" (lol).

(ohhhh boy that was a bad one, I originally patched it (cause lolwut spin the DB lock dozens of times per request?) when we ported the application's ORM layer and someone rolled back the patch because hey, these flows aren't populating their uids and it's blowing up when they go to save, and it never got fixed until it basically took down the prod database (oracle) server from resource contention going for the locks. I did warn them a couple times about that one too, but we had caching, so it wasn't that awful.. until it was.)

UUIDs have some very nifty properties for the way software is written today and basically aren't any more painful than having an id number. You can have a default UUID (or guid?) generator in postgres on a column if you want, so it behaves exactly identical there if you want to use it as a DB-generated value. Also, the UUID property is basically that two computers can just randomly generate numbers forever and will never hit a collision because the space is so big - meaning you can have clients (your application, or other microservices) generate their own uuids without having to touch the database or even the owning service. You just assign a uuid and send it because hey, it's new every time, that's what uuids are, and the database will (again, assuming a proper implementation of the generator) never have seen that uuid. So you don't have to ask the database to assign you sequence values ever.

RDBMS is still a performance-limited system, it's hard to scale well past single-machine designs, clustering just isn't good, so spending cycles inside the DB spinning in an RNG or waiting on locks is wasteful. Being able to push cpu load out of the DB (with DB generated sequences or UUIDs) to scalable commodity systems is a performance win and there's not really a big downside besides larger identifiers (eg 128b vs 64b long).

Paul MaudDib fucked around with this message at 06:50 on Jun 15, 2022

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
uuids are actually universally unique. it says so on the tin. not even in another galaxy could your id possibly be repeated

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

DELETE CASCADE posted:

uuids are actually universally unique. it says so on the tin. not even in another galaxy could your id possibly be repeated

oops, all the universes were initialized with seed(0)

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Just found out that azure's built-in pgbouncer listens on port 6432 instead of acting as a middleman on port 5432. Three months of mysterious random timeouts solved :suicide:

Just-In-Timeberlake
Aug 18, 2003

abelwingnut posted:

anyone migrated on-prem databases to aws rds? got a potential client who wants to migrate their 37 oracle databases to aws.

i've never dealt with oracle before or migrated a database to aws. on top of that, they want this done by the end of the year. given all of this, i'm not sure if this is something i want to involve myself with. i feel like there's a lot of uncertainty given their setup and my experience. i've only really messed with sql server.

i'd have some help with some of their people, but...still kind of wary. maybe i'm just being too cautious and this is fairly easy?

pays really well, so i've got to ask.

I migrated our MSSQL databases from on prem to RDS, was pretty straightforward*. For MSSQL you just make a backup (if the dbs are large break it up) upload to S3 and then do a restore, but for MySQL/Aurora I think there's an easier way to go about it if I remember correctly you just click the "Restore from S3" button on the RDS dashboard. The longest part was uploading a ~terabyte of database backups, if the DBs are large make sure you have a fast pipe.

Other than that do your research to make sure RDS is compatible with everything they need. I don't know Oracle, but in MSSQL for example, CLR functions are only supported for MSSQL 2016 and below, and there are some things disabled for security purposes no matter what the version (making calls to an API for example).

*migration was simple in the end, but I did a good amount of prep work setting up the VPC and security for the databases and then testing the process of migrating, connectivity, etc. This was a year ago, they probably have a CloudFormation template or some easier method of doing the VPC/security stuff now.

Just-In-Timeberlake fucked around with this message at 13:01 on Jun 15, 2022

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

abelwingnut posted:

anyone migrated on-prem databases to aws rds? got a potential client who wants to migrate their 37 oracle databases to aws.

i've never dealt with oracle before or migrated a database to aws. on top of that, they want this done by the end of the year. given all of this, i'm not sure if this is something i want to involve myself with. i feel like there's a lot of uncertainty given their setup and my experience. i've only really messed with sql server.

i'd have some help with some of their people, but...still kind of wary. maybe i'm just being too cautious and this is fairly easy?

pays really well, so i've got to ask.

It's very straightforward (posting from memory):

1. On a workstation with access to both source and target database machines install the AWS Schema Conversion Tool.
2. Launch the tool and configure source endpoint (on-prem oracle DB host) and target endpoint (AWS RDS Oracle, RDS Postgres, Aurora Postgres).
3. Select the database/schema/tables/functions/views on the source DB and create a conversion job to add these object to the destination DB
4. run the job (takes a few minutes, depending on the complecity of the source database).

5. Launch the AWS console and go to Database Migration Service
6. Configure your endpoints (specifying IP address and database engine) for source and target
7. create a replication instance. (This is the instance that does the actual work of copying data from source to target. The instance size will determine how long it will take to migrate the data.)
8. Create the migration task.

This is the heavy lifting. You'll specify both endpoints and your replication instance, as well as create rules for what you want to migrate, using wildcards and masks to select schemas/tables/etc. This is the part that takes a few tries to get right.

9. Run the task.

Once the task is active it will begin replicating data, starting with a full data load and then working on an ongoing basis to keep the data in sync from source to target for as long as you keep the job running. If you keep the job running it will continuously sync data forever, allowing you to verify target data as applicable.

10. Swing the workload from on-prem to AWS RDS while at the same time stopping the replication task.

abelwingnut
Dec 23, 2002


that's all super helpful--thanks.

yea, i thought they wanted to convert their oracle db to standard aws rds, but not even. what they clarified earlier today is they simply want to move their oracle db to rds oracle. as far as i can tell, i don't think that would require the conversion tool, though that is a pretty cool tool from aws.

since learning that, info i've been looking at is here: https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-an-on-premises-oracle-database-to-amazon-rds-for-oracle.html

seems pretty straightfoward.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
Postgres timestamp question:

I have a table:
code:
CREATE TABLE schema.table1 (
	host varchar(6) NOT NULL,
	target varchar(6) NOT NULL,
	stamp timestamp NOT NULL,
	duration numeric(5, 1) NULL,
	CONSTRAINT pk_day_host_target_stamp PRIMARY KEY (host, target, stamp)
);
(I know I know. Based on my post above that a composite pk is suboptimal...)

It is populated by grabbing the elapsed download time of a file from a target to a host every ten minutes and recording the result to a table. It contains the following:

code:
host	target		stamp			duration
lt01u	NRT20-C3	2022-06-15 12:10:00.710	130.92
lt01u	NRT20-C3	2022-06-15 12:00:01.301	130.93
lt01u	NRT20-C3	2022-06-15 11:50:00.703	130.92
lt01u	NRT20-C3	2022-06-15 11:40:00.837	130.92
lt01u	NRT20-C3	2022-06-15 11:30:00.749	130.92
lt01u	NRT20-C3	2022-06-15 11:20:00.713	130.92
lt01u	NRT20-C3	2022-06-15 11:10:00.701	130.93
lt01u	NRT20-C3	2022-06-15 11:00:01.202	130.92
lt01u	NRT20-C3	2022-06-15 10:50:00.694	130.93
lt01u	NRT20-C3	2022-06-15 10:42:40.389	130.93
I am trying to select the bottom row with
code:
SELECT * FROM download_times.download_times_minute WHERE stamp='2022-06-15 10:42:40.389';
but it returns no rows. I'm literally copy/pasting the stamp value from a SELECT * statement so I'd expect a return of

code:
lt01u	NRT20-C3	2022-06-15 10:42:40.389	130.93
Why is it not returning anything? Is it some weird formatting issue or a timestamp/timezone thing?

Agrikk fucked around with this message at 20:27 on Jun 15, 2022

abelwingnut
Dec 23, 2002


not sure what the exactly function is in postgres, but what happens when you use datepart() for each element of the stamp?

i've had similar trouble with sql server, and just resorted myself to always using datepart. always works.

nielsm
Jun 1, 2009



https://www.postgresql.org/docs/current/datatype-datetime.html says the timestamp type has 1 microsecond resolution, but your textual representation only has 1 millisecond resolution. The stored values probably have more precision than gets displayed, so they don't produce an exact match with the copy-pasted textual value.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
It turns out that the value of seconds was actually 40.389993 so yep, that was it.

So now the question is either: how do I reveal the actual stored value in a select statement?

or

How do I store the value at less precision?

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Agrikk posted:

It turns out that the value of seconds was actually 40.389993 so yep, that was it.

So now the question is either: how do I reveal the actual stored value in a select statement?

SQL code:
select to_char(stamp, 'YYYY-MM-DD HH24:MI:SS.US') as stamp_with_microsec_precision
See https://www.postgresql.org/docs/current/functions-formatting.html

Alternately, truncate the stored value to millisecond precision:

SQL code:
select *
from table1
where
    date_trunc('millisecond', stamp) = timestamptz '2022-06-15 10:42:40.389';
but this might have weird rounding gotchas.

I gotta say though that selecting by exact timestamp is... sort of a weird thing to do, I think?

TheFluff fucked around with this message at 23:47 on Jun 15, 2022

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
I'm curious what you're actually trying to do. "Get the bottom row" is not exactly a "data query", but of course it can be done with MIN or window functions if that's actually the intended logical operation. If you want "the earliest of the day", that's MIN.

Now then, on to the more harsh question. What's the time source for the data? If it's atomic or Unix time it (should be) monotonic, but if it's system time then it can flop backwards during DST changes. :clint: Since you didn't use timestamp with timezone, I'm a tad worried.


If your situation is "I just screwed up a single row after using the database for 9mo and just need to repair it", well there's an answer for that that works without flailing around with formatting, but sheesh I haven't needed it in so long... :ohdear: I had forgotten it was removed and replaced by (well seriously you shouldn't use it except when you've created an exact duplicate row accidentally but if you truly really absolutely must, look up ctid).

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

TheFluff posted:

I gotta say though that selecting by exact timestamp is... sort of a weird thing to do, I think?

PhantomOfTheCopier posted:

If your situation is "I just screwed up a single row after using the database for 9mo and just need to repair it", well there's an answer for that that works without flailing around with formatting, but sheesh I haven't needed it in so long... :ohdear:


This.

The batch job runs every ten minutes to collect the data, and I occasionally run the job manually which collects data at odd times. I've found a couple of timestamps that I'd like to delete and was struggling with the WHERE statement. Now that I know it's the microsecond vs millisecond issue identified upthread I can grab the appropriate entries using DATE_PART and then delete them.


PhantomOfTheCopier posted:

I'm curious what you're actually trying to do. "Get the bottom row" is not exactly a "data query", but of course it can be done with MIN or window functions if that's actually the intended logical operation. If you want "the earliest of the day", that's MIN.

My example had the specific line of data in the bottom row so I referred to it as such. In reality it's about a half dozen lines of non-standard times scattered thoughout a few thousand records.

Also: The time source is NOW() run on the server which is set to UTC. I'm not using timezones because I shove all time entries as UTC and just say "all times in UTC" and call it good. :)

prom candy
Dec 16, 2005

Only I may dance
I have a really stupid question, hopefully this is the right place.

Let's say I have three tables: Users, CarMakes, and UserCarMakes. CarMake just has values like "Honda" and "Toyota" and UserCarMake is just a many-to-many join table between User and CarMake.

How do I write a query to return all users that have a Honda AND a Toyota? I know I can do it pretty easily for an OR:

code:
SELECT * FROM `Users` 
  LEFT JOIN `UserCarMakes` ON `UserCarMakes.UserId` = `Users.id` 
  LEFT JOIN `CarMakes` ON `CarMakes.id` = `UserCarMakes.CarMakeId`
WHERE `CarMakes.value` IN ('Honda', 'Toyota')
But I don't know how to do it for an AND

code:
SELECT * FROM `Users` 
  LEFT JOIN `UserCarMakes` ON `UserCarMakes.UserId` = `Users.id` 
  LEFT JOIN `CarMakes` ON `CarMakes.id` = `UserCarMakes.CarMakeId`
WHERE `CarMakes.value`= 'Honda' AND `CarMakes.value` = 'Toyota'
This obviously returns no rows because it's searching for a single CarMake that has two values at once. One thought I had was doing a separate join for each requested type:

code:
SELECT * FROM `Users` 
  LEFT JOIN `UserCarMakes` ON `UserCarMakes.UserId` = `Users.id` 
  LEFT JOIN `Hondas` ON `CarMakes.id` = `UserCarMakes.CarMakeId` AND `CarMakes.value` = 'Honda'
  LEFT JOIN `Toyotas` ON `CarMakes.id` = `UserCarMakes.CarMakeId` AND `CarMakes.value` = 'Toyota'
WHERE `Hondas.id` NOT NULL AND `Toyotas.id` NOT NULL
But this feels wrong. Is there a right way to do this sort of query?

Just-In-Timeberlake
Aug 18, 2003

prom candy posted:


code:

SELECT * FROM `Users` 
  LEFT JOIN `UserCarMakes` ON `UserCarMakes.UserId` = `Users.id` 
  LEFT JOIN `Hondas` ON `CarMakes.id` = `UserCarMakes.CarMakeId` AND `CarMakes.value` = 'Honda'
  LEFT JOIN `Toyotas` ON `CarMakes.id` = `UserCarMakes.CarMakeId` AND `CarMakes.value` = 'Toyota'
WHERE `Hondas.id` NOT NULL AND `Toyotas.id` NOT NULL

It's fine. You could just use an INNER JOIN and you won't have to check for NULL

code:

SELECT 	* 
FROM 	`Users` 
        INNER JOIN `UserCarMakes` ON `UserCarMakes.UserId` = `Users.id` 
        INNER JOIN `Hondas` ON `CarMakes.id` = `UserCarMakes.CarMakeId` 
             AND `CarMakes.value` = 'Honda'
        INNER JOIN `Toyotas` ON `CarMakes.id` = `UserCarMakes.CarMakeId` 
             AND `CarMakes.value` = 'Toyota'

Just-In-Timeberlake fucked around with this message at 15:37 on Jun 17, 2022

prom candy
Dec 16, 2005

Only I may dance

Just-In-Timeberlake posted:

It's fine. You could just use an INNER JOIN and you won't have to check for NULL

code:

SELECT 	* 
FROM 	`Users` 
        INNER JOIN `UserCarMakes` ON `UserCarMakes.UserId` = `Users.id` 
        INNER JOIN `Hondas` ON `CarMakes.id` = `UserCarMakes.CarMakeId` 
             AND `CarMakes.value` = 'Honda'
        INNER JOIN `Toyotas` ON `CarMakes.id` = `UserCarMakes.CarMakeId` 
             AND `CarMakes.value` = 'Toyota'

Thanks, it seems like this doesn't actually work though. It does seem to work if I INNER JOIN the UserCarMakes table twice as well but it seems like it should work without having to do that.

Just-In-Timeberlake
Aug 18, 2003

prom candy posted:

Thanks, it seems like this doesn't actually work though. It does seem to work if I INNER JOIN the UserCarMakes table twice as well but it seems like it should work without having to do that.

ah yeah, that's my mistake, you do need that second UserCarMakes table join.

prom candy
Dec 16, 2005

Only I may dance

Just-In-Timeberlake posted:

ah yeah, that's my mistake, you do need that second UserCarMakes table join.

Okay, thanks for the help! Also I'll probably be back here on Monday when I can't figure out the best way to find users who DON'T have a Honda.

nielsm
Jun 1, 2009



prom candy posted:

How do I write a query to return all users that have a Honda AND a Toyota?

Use an EXISTS in the WHERE clause instead of joining the tables twice. That's also easier to chain when you need to check for three or four makes.

prom candy
Dec 16, 2005

Only I may dance

nielsm posted:

Use an EXISTS in the WHERE clause instead of joining the tables twice. That's also easier to chain when you need to check for three or four makes.

I don't think I've ever used an EXISTS before (can you tell I hardly ever step out of my ORM?) Do I still join or do I just do it all in a subquery in the WHERE clause?

nielsm
Jun 1, 2009



Phone posting here so can't write a full example. But join the tables you need to select data from, not those you only need to check conditions from.

select * from foo where exists(select 1 from bar join quux on bar.z = quux.z where foo.x = bar.x and quux.v = 'fart')

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

prom candy posted:

I don't think I've ever used an EXISTS before (can you tell I hardly ever step out of my ORM?) Do I still join or do I just do it all in a subquery in the WHERE clause?

can't really dig into your specifics right now, but fyi if you've never used EXISTS then you should definitely consider using "EXISTS (SELECT 1 from subquery)" whenever you reach for the "IN (subquery)" tool in your toolbox... IN is basically optimized for small static values and EXISTS works better with large selects on dynamic data. You would think the planner would be able to optimize them to the same query plan but it doesn't.

Generally, a good ORM will support EXISTS in its own language/criteria-builder as well, so you aren't forced to dig down to native SQL if you don't want.

https://www.mssqltips.com/sqlservertip/6659/sql-exists-vs-in-vs-join-performance-comparison/

https://www.javatpoint.com/in-vs-exists

Lonely Wolf
Jan 20, 2003

Will hawk false idols for heaps and heaps of dough.
fyi select 1 can thwart some query optimizers, which all special case select * in semi/antijoins to not actually grab anything

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
exists() for semi-joins is cool and good and better than in(), which sometimes fucks up your query plan. but where it really shines is the use of not exists() for anti-joins, which is especially cooler and gooder and betterer than not in(), which practically always fucks up your query plan

prom candy
Dec 16, 2005

Only I may dance
Cool I'll dig into this on Monday and see if I can rewrite that query with EXISTS. The anti-join thing is gonna be key as well because i just know that's the next feature request. Thanks all!

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
I don't like EXISTS and NOT EXISTS. I prefer to use a join, because it's less complicated.

RandomBlue
Dec 30, 2012

hay guys!


Biscuit Hider

Hammerite posted:

I don't like EXISTS and NOT EXISTS. I prefer to use a join, because it's less complicated.

They don't serve the same purpose. Sure, you can use join to work like exists but it's not the same and the performance isn't the same and if the table you're joining to has multiple results but you're not selecting from that table you'll have to distinct the output of the query.

e: assuming you weren't joking

abelwingnut
Dec 23, 2002


exists also handles NULL fields better.

RandomBlue
Dec 30, 2012

hay guys!


Biscuit Hider

abelwingnut posted:

exists also handles NULL fields better.

loving NULL

Data Graham
Dec 28, 2009

📈📊🍪😋



Thread title

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

RandomBlue posted:

They don't serve the same purpose. Sure, you can use join to work like exists but it's not the same and the performance isn't the same and if the table you're joining to has multiple results but you're not selecting from that table you'll have to distinct the output of the query.

e: assuming you weren't joking

That's true, I think I had in mind a table with a uniqueness constraint on the relevant columns.

prom candy
Dec 16, 2005

Only I may dance
Okay so here's my original query with the multiple joins

code:
SELECT * FROM `User`
 INNER JOIN `UserCarMake` AS `VolkswagenJoin` ON `VolkswagenJoin`.`userId` = `User`.`id`
 INNER JOIN `CarMake` AS `Volkswagen` ON `Volkswagen`.`id` = `VolkswagenJoin`.`carMakeId` AND `Volkswagen`.`name` = 'Volkswagen'
 INNER JOIN `UserCarMake` AS `HondaJoin` ON `HondaJoin`.`userId` = `User`.`id`
 INNER JOIN `CarMake` AS `Honda` ON `Honda`.`id` = `HondaJoin`.`carMakeId` AND `Honda`.`name` = 'Honda';
 
and here's my query with EXISTS

code:
SELECT * FROM `User` 
  WHERE 
    EXISTS (
      SELECT 1 FROM `CarMake` 
        INNER JOIN `UserCarMake` ON `UserCarMake`.`carMakeId` = `CarMake`.`id` 
        WHERE `CarMake`.`name` = 'Volkswagen' AND `UserCarMake`.`userId` = `User`.`id`
    )
  AND EXISTS (
    SELECT 1 FROM `CarMake` 
        INNER JOIN `UserCarMake` ON `UserCarMake`.`carMakeId` = `CarMake`.`id` 
        WHERE `CarMake`.`name` = 'Honda' AND `UserCarMake`.`userId` = `User`.`id`
  )
According to SQL Fiddle the first query takes 18ms and the second one takes 1ms, and they both return the correct data. Is this how you would write this query?

I also wrote this query for finding users that don't own a Kia and it seems to work:

code:
SELECT * FROM `User`
  WHERE 
    NOT EXISTS (
      SELECT 1 FROM `CarMake` 
        INNER JOIN `UserCarMake` ON `UserCarMake`.`carMakeId` = `CarMake`.`id` 
        WHERE `CarMake`.`name` = 'Kia' AND `UserCarMake`.`userId` = `User`.`id`
    );
Here's the fiddle if anyone feels like taking a look at it http://sqlfiddle.com/#!9/e2d3d5/9

Withnail
Feb 11, 2004
Is there a way in sql server to extract the data in a node of XML that is stored in an ntext column and expose it as a column in a view?

Edit. I guess you can

, r.XmlCol.value('(/parentnode//somenode//nodeiwantdatatodisplay//node()[1])', 'varchar(max)')
from table
outer apply (select cast(ntextcol as xml) as realxml) s
outer apply s.realxml.nodes('parentnode') as r(XmlCol)

Withnail fucked around with this message at 21:19 on Jun 20, 2022

Adbot
ADBOT LOVES YOU

prom candy
Dec 16, 2005

Only I may dance
This exists stuff helped me add a whole bunch of functionality to our app today, thanks again for all the help. I found a rails gem that makes it really easy to integrate into active record queries too.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply