Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
joojoo2915
Jun 3, 2006

quote:

code:

select cashid
from (
-- guarantee one row per elevator/date combination
select cashid
from the_table
group by cashid, tdate
)t -- derived table syntax requires an alias
group by cashid
having count(*) > 100 -- only pull elevators with more than 100 days worth of data

Thank you very much! Worked perfectly.

Adbot
ADBOT LOVES YOU

hexadecimal
Nov 23, 2008

by Fragmaster
How does the running time of MySQL queries compare to retrieving information from a text file? That is, how much more or less is it expensive to run something like
SELECT * FROM table_name versus reading all that information from a file without processing it?

Victor
Jun 18, 2004
File I/O costs will most likely dwarf CPU costs. Network I/O costs might not need to be considered if you're on a 1Gbit LAN. If the table is badly fragmented, then reading from a file would probably be faster.

hexadecimal
Nov 23, 2008

by Fragmaster

Victor posted:

File I/O costs will most likely dwarf CPU costs. Network I/O costs might not need to be considered if you're on a 1Gbit LAN. If the table is badly fragmented, then reading from a file would probably be faster.

Doesn't MySQL need to read from disk as well, though? Isn't it also affected by File I/O, or does it try to hold as much information in memory as possible?

Victor
Jun 18, 2004
Well, it clearly depends on the table size compared to memory availability, how well MySQL manages its buffers, how often the table is accessed, etc. My point in my previous post was that while MySQL might be slightly slower, I/O costs dwarf CPU costs, so the difference would be small. Network I/O costs and fragmentation of the table are really the biggest factors. If the table happens to be entirely cached in memory, then you have no I/O costs and you just have to see how network I/O costs factor in.

Pento McGreno
May 4, 2004

Cheese.

hexadecimal posted:

Doesn't MySQL need to read from disk as well, though? Isn't it also affected by File I/O, or does it try to hold as much information in memory as possible?

The first time, (provided you use a similar disk read method) flat file will be faster. There's just more overhead. Subsequent times, MySQL will be faster, assuming you've setup your buffers correctly.

hexadecimal
Nov 23, 2008

by Fragmaster

Pento McGreno posted:

The first time, (provided you use a similar disk read method) flat file will be faster. There's just more overhead. Subsequent times, MySQL will be faster, assuming you've setup your buffers correctly.

Can you elaborate more on setting up the buffers?

tayl0r
Oct 3, 2002
This post brought to you by SOE
You don't use a database because it's fast, you use a database because it gives you the best interface to inserting, updating, and selecting data (with locking, rollbacks, all the other features that relational databases give you).

If your app somehow just needs 1 huge text file to work, and you're not going to have to insert / update / delete data from the text file, then why use a database?

nbv4
Aug 21, 2002

by Duchess Gummybuns
In my database, I have a table called "flights" which has information about the flight, as well as a plane_id, which is connected to a row in a "planes" table which has information about the plane. I have a user who has like 900 planes in his database which are just basically "orphan" planes. I want to remove these duplicate planes, which do not have a single flight connected to them. How do I do this? I came up with a query which pretty much does what I want that uses LEFT JOIN's and discards all entries that have a flight_id value greater that's not NULL, but this query for some stupid reason takes like over 40 seconds to run on normal user_id's, and for this particular user, it I guess takes too long because it always returns a "1053" error. Whats the best way to do this?

edit: here is the query that is taking too long to run:

code:
SELECT * 
FROM planes 
LEFT JOIN flights 
ON flights.plane_id = planes.plane_id
WHERE planes.pilot_id = 137 AND flights.flight_id is NULL
This code works on just about every other user_id I've tried, but this one particular user who has like 900 planes, it takes so long it returns an error.

nbv4 fucked around with this message at 20:04 on Dec 15, 2008

Operation Atlas
Dec 17, 2003

Bliss Can Be Bought

nbv4 posted:

In my database, I have a table called "flights" which has information about the flight, as well as a plane_id, which is connected to a row in a "planes" table which has information about the plane. I have a user who has like 900 planes in his database which are just basically "orphan" planes. I want to remove these duplicate planes, which do not have a single flight connected to them. How do I do this? I came up with a query which pretty much does what I want that uses LEFT JOIN's and discards all entries that have a flight_id value greater that's not NULL, but this query for some stupid reason takes like over 40 seconds to run on normal user_id's, and for this particular user, it I guess takes too long because it always returns a "1053" error. Whats the best way to do this?

edit: here is the query that is taking too long to run:

code:
SELECT * 
FROM planes 
LEFT JOIN flights 
ON flights.plane_id = planes.plane_id
WHERE planes.pilot_id = 137 AND flights.flight_id is NULL
This code works on just about every other user_id I've tried, but this one particular user who has like 900 planes, it takes so long it returns an error.

What are your indexes like? Make sure you have an index on pilot_id, flight_id, and plane_id (on both tables!)

How big is the flights table? For it to be taking this long, you'd both have to have no indexes AND a table >1,000,000 rows.

Personally, I'd do:

code:
select * from planes where planes.pilot_id = 137 and (select count(*) from flights where plane_id=planes.plane_id) = 0
Which is not guaranteed to be faster, but might, depending on the cardinality of the flights.plane_id and planes.pilot_id indexes.

nbv4
Aug 21, 2002

by Duchess Gummybuns
^^ actually I'm stupid. There were no indexes... I added the indexes and now it runs in under a second.

Triple Tech
Jul 28, 2006

So what, are you quitting to join Homo Explosion?
What database system is this that something like that takes over 40 seconds?

nbv4
Aug 21, 2002

by Duchess Gummybuns

Triple Tech posted:

What database system is this that something like that takes over 40 seconds?

MySQL 5 with about 14,000 rows total and no indexes lollers.

Triple Tech
Jul 28, 2006

So what, are you quitting to join Homo Explosion?
I guess I'm still very new to databases, and I'm not exactly how powerful our hardware is (and it might be since THE MANAGERS just like to throw money at the problem instead of make things The Right Way (tm)). But in my limited experience, anything that's over a couple seconds is suspect to being approached incorrectly, and I work with tables that are at least in the hundreds of thousands rows range. Tens of seconds is just... Crazy to me. Chalk one up for Microsoft and SQL Server, I guess.

The longest part of the process I'm working on is selecting (almost) an entire table of data (like four million rows) and "translating" it to another format for insertion in a different table. That takes like... 13 minutes?

No Safe Word
Feb 26, 2005

Triple Tech posted:

I guess I'm still very new to databases, and I'm not exactly how powerful our hardware is (and it might be since THE MANAGERS just like to throw money at the problem instead of make things The Right Way (tm)). But in my limited experience, anything that's over a couple seconds is suspect to being approached incorrectly, and I work with tables that are at least in the hundreds of thousands rows range. Tens of seconds is just... Crazy to me. Chalk one up for Microsoft and SQL Server, I guess.

The longest part of the process I'm working on is selecting (almost) an entire table of data (like four million rows) and "translating" it to another format for insertion in a different table. That takes like... 13 minutes?

I've had some stuff that was pretty well-optimized and still took over half a day to run, but it was Type 2 slowly-changing dimension stuff on a multi-billion row database, so it's entirely understandable to take quite some time. :v:

Triple Tech
Jul 28, 2006

So what, are you quitting to join Homo Explosion?
What is type 2? Kind of difficult to Google for, I'm not sure what I'm looking at.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Triple Tech posted:

What is type 2? Kind of difficult to Google for, I'm not sure what I'm looking at.
SCD Type I is discard history, SCD Type II is retain history, SCD Type III is retain last history as attribute.

No Safe Word
Feb 26, 2005

Triple Tech posted:

What is type 2? Kind of difficult to Google for, I'm not sure what I'm looking at.

wikipedia posted:

http://en.wikipedia.org/wiki/Slowly_Changing_Dimension#Type_2

Transactions that reference this Surrogate Key (Supplier_Key) are then permanently bound to these time slices defined by each row in the slowly changing dimension table. If there are retrospective changes made to the contents of the dimension, or if a new set of attributes are added to the dimension (for example a Sales Rep column) which have different effective dates to those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCD are not a good choice if the dimensional model or data is subject to change.

:v:

Unfortunately sometimes it's appropriate and you just have a buttload of data to process.

Aredna
Mar 17, 2007
Nap Ghost
What are some good books or other resources to learn about SCD and other types of data modeling?

When I had trouble reporting on when people moved from one supervisor to another or changed job functions I re-invented Type 3 without knowing what it was until today. Unfortunately I'm just a report bitch and never got my ticket pushed through to make it happen and have since moved on to another department that doesn't have the same problems.

tayl0r
Oct 3, 2002
This post brought to you by SOE

Aredna posted:

What are some good books or other resources to learn about SCD and other types of data modeling?

When I had trouble reporting on when people moved from one supervisor to another or changed job functions I re-invented Type 3 without knowing what it was until today. Unfortunately I'm just a report bitch and never got my ticket pushed through to make it happen and have since moved on to another department that doesn't have the same problems.

"The Data Warehouse Toolkit" is pretty much the bible of dimensional data warehousing, but it is pretty long. I don't recommend it unless you actually want to have a career in the field.

Mega Shark
Oct 4, 2004
How can I insert large amounts of binary data via SQL without using load_file?

MySQL 5.0
C++

tayl0r
Oct 3, 2002
This post brought to you by SOE

ODC posted:

How can I insert large amounts of binary data via SQL without using load_file?

MySQL 5.0
C++

You could try using a tool like SyncSort (http://www.syncsort.com/products/dmx/home.htm, not free, you cant even download a trial), but I'm not sure it would be any faster than just writing your own import script which does the inserts line by line.

In Oracle you could use SQL*Loader (sqlldr), which basically bypasses the normal oracle interface for inserts and instead writes the data straight to the table. It also works with binary data.

Pento McGreno
May 4, 2004

Cheese.

hexadecimal posted:

Can you elaborate more on setting up the buffers?

Use InnoDB, and set innodb_buffer_pool_size to something big (size of data + indexes + 10%). Data and indexes will be stored here in memory, rather than on disk.

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb
If there are two tables (mysql):

product
-id
name

productvote
-productId
-timestamp

I'm trying to get a list of products ordered by the time it was last voted on.

SELECT product.id, product.name FROM productvote JOIN product ON product.id = productvote.productId ORDER BY productvote.timestamp DESC;

That seemed to work, but it can show duplicate products. Doing a SELECT DISTINCT instead didn't seem to work like I thought it did, why not?

Or should I do it like this? How do I maintain the order of the productIds inside the IN() in the result of the parent query?

SELECT product.id, product.name FROM product WHERE product.id IN (SELECT DISTINCT productvote.productId FROM productvote ORDER BY productvote.timestamp DESC);

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

fletcher posted:

If there are two tables (mysql):

product
-id
name

productvote
-productId
-timestamp

I'm trying to get a list of products ordered by the time it was last voted on.

SELECT product.id, product.name FROM productvote JOIN product ON product.id = productvote.productId ORDER BY productvote.timestamp DESC;

That seemed to work, but it can show duplicate products. Doing a SELECT DISTINCT instead didn't seem to work like I thought it did, why not?

Or should I do it like this? How do I maintain the order of the productIds inside the IN() in the result of the parent query?

SELECT product.id, product.name FROM product WHERE product.id IN (SELECT DISTINCT productvote.productId FROM productvote ORDER BY productvote.timestamp DESC);

given that it's mysql, you probably want something like
code:
SELECT product.id,
       product.name 
  FROM (select productId
              ,max(timestamp) as timestamp
          from productvote) as pvote JOIN
       product ON product.id = pvote.productId ORDER BY pvote.timestamp DESC;
Though I'm not sure about the aliasing syntax, and it's possible your version of mysql may not allow subqueries.

Operation Atlas
Dec 17, 2003

Bliss Can Be Bought

fletcher posted:

If there are two tables (mysql):

product
-id
name

productvote
-productId
-timestamp

I'm trying to get a list of products ordered by the time it was last voted on.

SELECT product.id, product.name FROM productvote JOIN product ON product.id = productvote.productId ORDER BY productvote.timestamp DESC;

That seemed to work, but it can show duplicate products. Doing a SELECT DISTINCT instead didn't seem to work like I thought it did, why not?

Or should I do it like this? How do I maintain the order of the productIds inside the IN() in the result of the parent query?

SELECT product.id, product.name FROM product WHERE product.id IN (SELECT DISTINCT productvote.productId FROM productvote ORDER BY productvote.timestamp DESC);


Personally I'd do:

code:
select * from product order by (select max(timestamp) from productvote where product.id=productId) DESC
But this may be slow depending on the size of your product table. (<1000 and you're probably fine)

tayl0r
Oct 3, 2002
This post brought to you by SOE

Praetorian42 posted:

Personally I'd do:

code:
select * from product order by (select max(timestamp) from productvote where product.id=productId) DESC
But this may be slow depending on the size of your product table. (<1000 and you're probably fine)

As someone who has wrote a poo poo load of sql, your solution offends me.

I think 99% of the time, this is a horrible way to do it. Every row in the product table is going to spawn 1 select from the product vote table. It's going to take hundreds of times longer to complete. Sure, its fine if your data set is small but even so, you shouldn't encourage doing things the wrong way.

You get the same exact result just by creating a subquery of product and max(timestamp) and joining that with your product table. This is how Jethro solved the problem. It's a 1 to 1 relationship and it should be super fast, and it properly scales with data set size.

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb
Thanks for all the responses guys.

Jethro posted:

given that it's mysql, you probably want something like
code:
SELECT product.id,
       product.name 
  FROM (select productId
              ,max(timestamp) as timestamp
          from productvote) as pvote JOIN
       product ON product.id = pvote.productId ORDER BY pvote.timestamp DESC;
Though I'm not sure about the aliasing syntax, and it's possible your version of mysql may not allow subqueries.

This query seems really slow (~2 sec for a pretty small data set)

What indexes would help me with a query like this?

code:
mysql> EXPLAIN SELECT product.id, product.text FROM (SELECT productId, 
MAX(timestamp) AS timestamp FROM productvote GROUP BY productId) AS pvote JOIN 
product ON product.id = pvote.productId ORDER BY pvote.timestamp DESC LIMIT 0,25;
+----+-------------+-------------+------+-----------------+------+---------+-----------------+--------+---------------------------------+
| id | select_type | table       | type | possible_keys   | key  | key_len | ref             | rows   | Extra                           |
+----+-------------+-------------+------+-----------------+------+---------+-----------------+--------+---------------------------------+
|  1 | PRIMARY     | <derived2>  | ALL  | NULL            | NULL | NULL    | NULL            | 152254 | Using filesort                  |
|  1 | PRIMARY     | product     | ref  | PRIMARY,id_2,id | id   | 4       | pvote.productId |      1 | Using where                     |
|  2 | DERIVED     | productvote | ALL  | NULL            | NULL | NULL    | NULL            | 958686 | Using temporary; Using filesort |
+----+-------------+-------------+------+-----------------+------+---------+-----------------+--------+---------------------------------+

Operation Atlas
Dec 17, 2003

Bliss Can Be Bought

tayl0r posted:

As someone who has wrote a poo poo load of sql, your solution offends me.

I think 99% of the time, this is a horrible way to do it. Every row in the product table is going to spawn 1 select from the product vote table. It's going to take hundreds of times longer to complete. Sure, its fine if your data set is small but even so, you shouldn't encourage doing things the wrong way.

You get the same exact result just by creating a subquery of product and max(timestamp) and joining that with your product table. This is how Jethro solved the problem. It's a 1 to 1 relationship and it should be super fast, and it properly scales with data set size.

I said it would be slow.

I was going for something that was easy to understand. It doesn't help the person if you give them a query and they don't understand what makes it work. ORDER BY subqueries are conceptually very easy to understand (for me at least), while joining on a subquery might not be so much.

If there wasn't already a fast solution posted I would have given one.

Operation Atlas
Dec 17, 2003

Bliss Can Be Bought

fletcher posted:

Thanks for all the responses guys.


This query seems really slow (~2 sec for a pretty small data set)

What indexes would help me with a query like this?

code:
mysql> EXPLAIN SELECT product.id, product.text FROM (SELECT productId, 
MAX(timestamp) AS timestamp FROM productvote GROUP BY productId) AS pvote JOIN 
product ON product.id = pvote.productId ORDER BY pvote.timestamp DESC LIMIT 0,25;
+----+-------------+-------------+------+-----------------+------+---------+-----------------+--------+---------------------------------+
| id | select_type | table       | type | possible_keys   | key  | key_len | ref             | rows   | Extra                           |
+----+-------------+-------------+------+-----------------+------+---------+-----------------+--------+---------------------------------+
|  1 | PRIMARY     | <derived2>  | ALL  | NULL            | NULL | NULL    | NULL            | 152254 | Using filesort                  |
|  1 | PRIMARY     | product     | ref  | PRIMARY,id_2,id | id   | 4       | pvote.productId |      1 | Using where                     |
|  2 | DERIVED     | productvote | ALL  | NULL            | NULL | NULL    | NULL            | 958686 | Using temporary; Using filesort |
+----+-------------+-------------+------+-----------------+------+---------+-----------------+--------+---------------------------------+

Make sure you have an index on productId.

Also: Is it just me or do MySQL explain plans suck rear end compared to Oracle/Postgres ones? That poo poo is impossible to decode.

Operation Atlas fucked around with this message at 04:31 on Dec 18, 2008

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb
Yep, there is.

edit: ^ yeah that's why I asked, I don't know how to make sense of it either.

fletcher fucked around with this message at 04:44 on Dec 18, 2008

tayl0r
Oct 3, 2002
This post brought to you by SOE
One of the most basic concepts of writing SQL, that I think a lot of people don't realize, is that any result set can be used just like a table.

You can select from it, join it with other tables or result sets, etc.

The problem in this case is that you have a one-to-many relationship between the product and productVote tables. You want to get 1 row back per product. So, when you join those two tables together you need to somehow get the productVote table down to only 1 row per product.

The proper way to do that, in this case, is with the max() aggregation function, grouping by productID.

That gives you back 1 row per product, with the most recent date that it was voted on.

Then, join that result set with the product table and you are good to go.

Once you can understand SQL concepts like this, you can start writing some pretty awesome queries.


This is how Jethro solved the problem originally.

tayl0r
Oct 3, 2002
This post brought to you by SOE
You basically want to have an index on anything that is in your where clause. The thing to keep in mind is that if you're going to be getting back a majority of the rows that are in the table, the optimizer probably thinks its faster to just do a full table scan and not use any indexes at all.

In this case, since all products probably have a rating, your database is probably not using any indexes since it is accessing the majority of the records in the product table and in the productVote table. So, it's probably doing 2 full table scans.

You might be able to build an index on the product table on productID and productName, then it might realize it could just get all the data it needs from the index and not have to hit the table at all.

One thing that I think is useful sometimes is just to build a shitload of indexes. Build indexes on every individual column, build indexes on multiple columns together, etc.
Make sure you do an analyze or a gather stats or whatever you need to do to make your database regenerate it's index and table stats.

Then, look at your explain plan for your questy and see if it is using any of those indexes. Maybe try to give it some hints if its not performing the way you expect.

Then, drop the indexes that it isn't using.

Victor
Jun 18, 2004
First, let me reiterate that MySQL sucks rear end. I kind of wonder whether people tend not to know the better ways to formulate SQL (like tayl0r was mentioned) because of this. I am starting to wonder how much MySQL stunts people, kind of like BASIC (VB, not as much) supposedly stunts people.

tayl0r, it's freaking awesome to completely understand the relational nature of DB work, but do understand that it can be useful to have an equivalent (logic-wise) subquery version that you can point to and say, "these do the same thing, but this one is better and here is why". Don't require such a fricken abrupt transition between traditional imperative thought and relational-functional thought.

And an aside, I find this syntax easier to read; does anyone agree with me?

code:
select  product.id,
        product.name 
from   (
    select  productId,
            timestamp = max(timestamp)
    from    productvote
) pvote 
join product on product.id = pvote.productId 
order by pvote.timestamp desc

tayl0r
Oct 3, 2002
This post brought to you by SOE
Good point Victor. I should realize that it does take quite a mind-shift to think relationally instead of procedurally.

Another thing I've noticed is that there are a ton of programming aids, frameworks, libraries, and tools out there solely devoted to preventing you from having to write SQL.
It's really easy to write applications that have complicated database backends, but not write a single line of SQL yourself.


I don't like that syntax. I prefer to keep each query tabbed over the same amount, like this:
code:
select p.id,
p.name 
from
(
    select productId,
    max(timestamp) timestamp
    from productvote
    group by productId
) pvote,
product p
where p.id = pvote.productId 
order by pvote.timestamp desc

Victor
Jun 18, 2004
Perhaps this might be good for the original post: The Database Programmer. It's a blog written by a guy who seems to understand databases well, on database fundamentals. I didn't have any issues with his first few entries (table of index), but I haven't vetted all of them. No derived tables yet or in the plans, unfortunately. :-(

IMHO, we need more web-accessible, free content on doing a better job interacting with databases. From what I've seen, The Database Programmer blog shows promise.

Pento McGreno
May 4, 2004

Cheese.

Victor posted:

And an aside, I find this syntax easier to read; does anyone agree with me?
Aye, similar tabbing, with a few differences.

code:
SELECT product.id
      ,product.name 
FROM  (
      SELECT productId
            ,timestamp = max(timestamp)
      FROM   productvote
      ) pvote 
JOIN product
     ON product.id = pvote.productId 
ORDER BY pvote.timestamp DESC
Comma at the start takes a bit of getting used to, but I find it looks better for large lists. Capitalize SQL keywords, to provide a nice visual indicator on where each section begins and ends.

Triple Tech
Jul 28, 2006

So what, are you quitting to join Homo Explosion?

Victor posted:

Perhaps this might be good for the original post: The Database Programmer. It's a blog written by a guy who seems to understand databases well, on database fundamentals. I didn't have any issues with his first few entries (table of index), but I haven't vetted all of them. No derived tables yet or in the plans, unfortunately. :-(

This site looks awesome so far.

Also, capitalizing keywords isn't really needed with good syntax highlighting.

MoNsTeR
Jun 29, 2002

code:
select p.id as PRODUCT_ID
     , p.name as PRODUCT_NAME
  from (
        select productId
             , max(timestamp) as timestamp
          from productvote
         group by productId
       ) pv
     , product p
 where pv.productId = p.id
 order by pv.timestamp desc
1. leading commas > trailing commas
2. SQL/86 joins > SQL/99 joins, also write your joins left-to-right as you go down your table list
3. align commas and the last character of standard keywords
4. all lower case (except top-level query column aliases for clarity)
5. indent subqueries
6. always use the "as" keyword
7. use short table aliases, no more than 4 characters

I frequently write and maintain 300+ line queries and strictly adhering to this system (plus more rules specific to case, decode, etc.) is what keeps me sane.

Adbot
ADBOT LOVES YOU

tayl0r
Oct 3, 2002
This post brought to you by SOE
Do you guys use any tools like PL/SQL Developer, Oracle SQL Developer, Visual Studio, etc?

PL/SQL Developer is by far the best tool that I've ever used for writing SQL and working with the database. It's specifically a PL/SQL development IDE but it also has a ton of things just for browsing database objects, looking at explain plans, writing SQL, importing & exporting data from tables, comparing table DDL in two different schemas or databases, etc.

It's a shame it only works with Oracle.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply