Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

jwnin posted:

As a result, our devevelopment team thinks that the best thing to do is to split each customer out into its own database. The catch is that there are around 20,000 customers per year, and each customer database would be about 100 MB in size.

Can anyone provide me with some more ammo as to why we would not want to go down this path?
This is a horrible idea. If the performance is slow, then you need to first look at ensuring your indexes and filesystems are set up correctly. Once that's out of the equation, you might simply need bigger iron. Only as a last resort would I consider physically splitting up the databases, and even then I wouldn't assume 1 customer per database, I'd put as many as possible into a single database.

The more disparate databases you have, the more work you'll have to do when it comes to setup, backup and restore, and ensuring schema changes are propagated to each database. And it's utter hell to do any sort of report that requires data from multiple databases. Depending on the complexity, a cross-database report would have to extract the necessary info from each database, insert it into a temporary database , and then run the report. That's very slow.

I speak from experience: our company had a database built around our product, and then management surprised us by announcing a new product. We didn't have time to rejig the database to support multiple products, so we simply duplicated the database schema and kept them separate, hoping that some day we'd be able to merge the two. It's caused us a load of grief and time and I wish we'd kept it all in one.

Adbot
ADBOT LOVES YOU

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

Argue posted:

I'm using Oracle. What's the best way to do something similar to WHERE name IN (a set of values), except instead of checking for equality, we use the LIKE keyword? It would end up being functionally the same as WHERE name LIKE val1 OR name LIKE val2 OR name LIKE val3, except the number of right hand values can change. Currently, what I'm doing is pre-preparing the SQL by iterating over the list of possible names (which are not retrieved from the DB) and appending "OR name LIKE ?" each time.
Perhaps you can build a more general Perl-compatible regex, and that'll only use 1 statement? E.g.:
code:
SELECT * FROM foo WHERE bar *= '#(foo|bar|baz)#';
Since Perl-compatible regexes can be compiled, that might be faster. Plus, it's more flexible than LIKE.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
Imagine that you want to enter the user information into an Excel spreadsheet, like so:
code:
user_id    login_name   real_name    date_of_birth
100        bob          Bob Smith    1990-11-12
101        john         John Jones   1989-07-03
102        jane         Jane Grant   1994-08-19
A single spreadsheet would be the same as a table (lets call this one "users"). The columns of the spreadsheet are the fields, and each row is a database row.

So if you want to look up the information for user "bob", then you might use the SQL:
code:
SELECT user_id, login_name, real_name, date_of_birth
FROM users
WHERE login_name="bob"
Edit: To elaborate, a table consists of a list of 0-many rows. Each row represents all the information about some "thing". A row consists of various fields that describe that single thing. So examples of tables would be "users", "projects", "products", etc. Each row in those tables describes a single user, project or product. The fields depend on the thing you're trying to describe. A user might have an ID, name, DOB, etc. A product might have an ID, name, quantity available, etc.

minato fucked around with this message at 10:17 on Nov 12, 2007

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

nbv4 posted:

One more question. Is there a guide somewhere that helps you determine which data type is the best for certain kinds of data? For instance, whats the best type to use to store a users email address? What if you have a value that only has a small number of values? Such as, for example, car manufacturer. Lets say there are only 5 possible manufacturers. Would it be best to store that data as either "1,2,3,4,5", then have each number correlate to a manufacturer, or would it be best to just store it as a string? Or what about setting up 5 different boolean rows for each manufacturer? Would it matter which way you did it?
In general, the field type you use should be similar to the type you use in your application. So for an email address, you'd use a string.

To store a car manufacturer, there are lots of ways you could do it, and there are pros and cons to each way. If you chose to use a string and store it directly (e.g. "Chrysler"), then it's easy to just browse the car rows and see which car corresponds with which manufacturer. However most people will use integers because it's more space efficient, faster for the database to process, is less prone to spelling/case errors, and is less effort to update if the name ever changes (say, due to a typo).

Most people would set it up like this:

code:
table : "manufacturers"
manufacturer_id  name
1                Ferrari
2                Lamborghini
3                Bugati

table: "cars"
car_id manufacturer_id name
1      3               Veyron
2      2               Contach
3      3               Type 57FC
Where there was a "referential integrity" check between cars.manufacturer_id and manufacturers.manufacturer_id. That means that you can't insert a row into the cars table unless the manufacturer_id exists somewhere within the manufacturers table. This helps prevent data integrity errors.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
Not closing the </form>?

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
You need to put the parameters into quotes. Try something like this:
php:
<?
$sql = sprintf(
       "SELECT *
        FROM pilots
        WHERE email='%s'
          AND password='%s'",
        mysql_real_escape_string($email),
        mysql_real_escape_string(md5($password))
        );
$result = mysql_query($sql);
?>

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

Opius posted:

However, instead of just finding the items with the most 'yay' votes, I'd like to find the ones with the highest ratio of 'yay' votes to 'nay' votes.

Is there a particular way to assign ENUM values to 'fake' columns, so I could do '(yay/nay) AS ratio'? Or is there a better way to achieve the same result?
If you were storing "nay" votes as NULL and "yay" votes as any non-null value, then you could simply do:
code:
SELECT content_id, count(vote)::float/count(*)::float AS ratio
FROM votes
GROUP BY content_id
ORDER BY ratio DESC
This works because count() will only count non-null values if you specify a column.

Edit: Jethro's solution is better, replace my count(vote) with "sum(case when vote="yay" then 1 else 0 end)".

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
That's what you get for choosing the Entity-Attribute-Value model, which many people consider to be a design anti-pattern because of its many drawbacks. You're either going to have to do a JOIN per column, or a sub-select like you're doing now.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

fletcher posted:

So does that mean the views column shouldn't be updated every time a unique user views the article?

I guess this also means that I don't really understand what an indexed column is. Is there an article that will help me out?
An indexed column is simply one that has an index attached to it. An index is a data structure internal to the database that's automatically read and updated by the database (and not by you) to help the database quickly locate rows when you search that column. The advantage is better performance when you're searching that column, and it's especially useful when there are thousands of rows involved. It can bring report times down from minutes to mere seconds. The downside is that there's a performance overhead involved in updating the index whenever you insert, update or delete values in that column.

So there's a trade-off. If your "views" column is being updated very very frequently (like, you're slashdot or something), then adding an index to that column will slow down your DB a little bit, but the upside is that your report to get the top 100 articles sorted by view will be very quick. You'd have to have a pretty heavy-traffic site to notice any slowdown though.

In the event that you did get unacceptable slowdown, the solution would be to sacrifice accuracy for speed. Either:
- Have two "views" columns, one indexed and one unindexed. Keep the unindexed one up-to-date and accurate, and periodically (via a cronjob) copy over this value to the indexed column.
- Have one unindexed "views" column, but cache the results of the top 100 articles so that you don't see the performance hit every time.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

Dakha posted:

This is probably really easy to fix, but I'm trying select a list of material numbers, and their corresponding text descriptions. However, the table with the text descriptions has several entries for each material number, for different languages. I want to get the german description, unless it doesn't exist, in which case I want the english descriptions.

code:
SELECT materials.material_number
    , (SELECT material_text
       FROM mat_desc
       WHERE mat_desc.material_number = materials.material_number
       ORDER BY (CASE WHEN mat_desc.language = "DE" THEN "00000"
                      ELSE mat_desc.language
                 END)
       LIMIT 1
       )
FROM materials
ORDER BY material_number;
This will select all materials, and for each material do a sub-select on the language. The CASE statement will order it so that the DE language comes first (assuming "00000" is considered to be before any existing language name).

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
I have a general SQL table design question.

Let's say I have a table that stores "application events". The table needs to refer to the source of the event, and these sources may be of various different types, from various unrelated sub-systems of the application. E.g.

- Forum thread X was updated
- File Y was uploaded
- User Z was added

I'm trying to work out the best way of designing a table to store the reference to the source, and using a FK constraint as well (so that deleting the source of the event would also delete the event).

Option 1) 3rd-normal form ahoy!
-----------------------------
code:
table: "event"
event_id | event_time

table: forum_event
event_id (FK on event.event_id) | forum_thread_id 

table: file_event
event_id (FK on event.event_id) | file_id

table: user_event
event_id (FK on event.event_id) | user_id
Four tables. One holds the "common" information about the event, and the others hold the "specific" information about that event.

Pros:
+More space efficient (to an extent)
+More correct database normalization
Cons:
-Suckier to work with (requires JOINs, more complex collation code)
-Small performance hit from the FK constraints on event.event_id
-A new table is required to add a new type.


Option 2) Use a single table:
-----------------------------
code:
table: "event"
event_id | event_time | forum_thread_id | file_id | user_id
Each row contains an event_id/event_time, and exactly ONE of "forum_thread_id", "file_id", and "user_id" will ever be non-NULL.

Pros:
+Simple to understand
+Means only searching one table, less schema management
+A new column is all that's required to add a new type.
Cons:
-Should implement a row "CHECK" to ensure that exactly 1 of the event-specific fields was chosen.
-A little more code required to identify which type of event this is.


Option 3) Store a "type" and ID
-----------------------------
code:
table: "event"
event_id | event_time | type | id
Here, "type" would be a string like "forum", "file", or "user". "id" would be the ID of that forum, file or user.

Pros:
+Succinct and simple
Cons:
-Not possible to use FK constraints on "id".
-Alarmingly like the "Entity-Attribute-Value" anti-pattern.



In the past I've implemented Option 1, like a good little database normalizer. However the clunkiness of doing so has led me to want to implement Option 2. The rest of my team (not very DB-design-savvy) is advocating Option 3, which to me is right out because it would take god himself to pry FK constraints from my dead hands.

Has anyone had any experience on what is the best approach?

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
The events system was actually just an example to couch the general principle in and get it across, I'm not actually writing an events system. In my particular case, the FKs will not introduce significant overhead.

I was more concerned about what people thought of breaking 3rd-normal form in order to simplify the matter of referring to various different types, and whether people had actually used the EAV anti-design pattern I mentioned in Option 3 and regretted it. Thanks for the comments so far!

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

m5 posted:

Example: if you have a column in a table that references another table, and you put a FK constraint on that, you'd better put an index on it too - even if you have no reason to do so. Why? Because if you don't, when you delete rows from the target table the server will have no choice but to do a table scan of the other table to make sure you're not violating the constraint.
Sure, but 99% of the time the FK points to the PK of the target table, and PKs should always have an index on them to ensure uniqueness.

m5 posted:

I believe that FK constraints are a good idea when appropriate and when you can afford them, but to blindly throw them into your schema whenever you have a cross-table reference is not wise.
I see them as just a good database design principle. They ensure my data integrity, they can help prevent silly coding errors, they can improve performance by ensuring I don't have crufty tables, and they help describe the relationship between two tables.

However like all principles, they sometimes need to be violated in the name of performance.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

Victor posted:

At least, where I work, sometimes we go without constraints, but have nightly integrity checks that run to ensure there are no orphaned records.
As I see it, you do have FK constraints - they're just deferred to a nightly check as opposed to an immediate check for performance reasons. I think that's acceptable, as long as I was sure that my own integrity checks were correct, and that my app wouldn't crash if it came across an orphaned record.

Currently I'm not sure of either of those things in my own app, which is why I'd be reluctant to do it this way.

Xae, I see your point of view but you seem to be under the assumption that you can throw enough hardware and money at any scalability problem in order to fix it. I don't think that's always the case, and isn't always practical. I'd love it if I could get better performance out of my app just by putting it on bigger iron. :sigh:

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
That's exactly how you'd do it. That query would work fine.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

fletcher posted:

I understand the idea of separating logic in a model-view-control system, but I don't understand the concept of separating the SQL with a templating system. I'm working in PHP, where can I find more info about how and why to do it?

What do you mean by SQL templates exactly? I can interpret what you said in 3 ways:

1) Making a paramatized SQL query, then using "prepare(query)" and "execute(query_handle, parameters)" to actually execute it.

2) A set of common-place SQL tasks which have already been written for you, you just fill in the blanks. Kind of like a sample web-site layout.

3) Abstracting persistent record information away from the main code by using some sort of Persistence Layer (also known as ActiveRecord).

Or did you mean something else?

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

jwnin posted:

I cannot agree more with this. I get uncomfortable with ORM after a certain load.

It really seems to me like there are two entirely different camps regarding databases:
1) Clean separation between the app tier and the database tiers. Enforced through by denying access to tables and driving access through stored procs, views, etc, allowing for a highly secure design and the ability to change the physical data model with no impact to the application tier.
2) The ORM camp. I hate to be short sighted, but it just seems like a pattern used by developers who see the database as nothing more than a file system. I know this is not correct, but there is some truth to it.

We use ORM via a bespoke Persistence Layer mechanism, and for the most part it is great. It's so convenient to be able to write code like this:

php:
<?
$org_name = User::getByOID($user_id)->getOrganization()->getName();
?>
Internally, the first function call will get the row containing the user object, convert it into an object. The second "->getOrganization()" will use some shadow info to retrieve the row containing that user's organization, convert that into an object. And the final call "->getName()" will retrieve the Name of that organization. So internally, 2 SQL calls are made to retrieve the User and Organization rows respectively.

This kind of code is very clear, there's no SQL involved so it's more maintainable and there's less for the app developer to understand, and it's very easy to write.

But where it falls down is that it's very difficult to optimize without breaking the paradigm. Say I wanted to do this kind of thing in a loop:
php:
<?
foreach($user_id_array as $user_id) {
   $org_names[] = User::getByOID($user_id)->getOrganization()->getName();
   ...
}
?>
Internally that's still 2 SQL calls per user-id which is hardly scaleable. A SQL developer would know that he could get the entire list with just 1, but it's necessary to break out of the ORM system in order to do that.

My experience seems to jibe with that of jwnin's above, in that some of my co-developers are only willing to use ORM and will go to any lengths to avoid writing SQL (because they're not very good at it), rather than seeing ORM as just a convenient tool that should be abandoned once performance gets ridiculously slow.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

fletcher posted:

Lets say you have products in one table and votes for the products in another table. How do you get the products and total votes for each product?
code:
SELECT product_name, COUNT(votes)
FROM products
   JOIN votes USING (product_id)
GROUP BY product_name

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

nbv4 posted:

I have a PHP script that makes a few hundred mysql queries, then executes them one by one. All of these queries are INSERT queries. The way I way it is now, if one of those queries fails, it's a pain in the rear end to go through and fix whatever problem that may have occurred. I want to have it so the system does a dry run through each query, and only if every single query is valid, it executes them all. Is there any way to test a query against the database? It's probably very simple, but Googling gives me nothing.
Why would a query fail? If a query fails, that's a sign that either your data or code is broken.

Do you mean "Don't fail if the record is already there?". Because to do that, you can always do:

code:
INSERT INTO foo (foo_id, bar)
SELECT $foo_id, $bar
WHERE NOT EXISTS (SELECT 1
   FROM foo
   WHERE foo_id = $foo_id
   )

minato fucked around with this message at 11:33 on Feb 7, 2008

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

nbv4 posted:

the user uploads a file that is transformed into about 600 queries. If one line of that data is not right for whatever reason, only 599 lines will get entered, and it'll be absolute hell to get things right. I need some way to ensure 100% of the queries are correct, or else it tells the user his data needs looking over.

You also sound like you might want to learn about transactions. That's where you run a "BEGIN" statement to take a snapshot of the database, run a bunch of SQL statements, and if any one of them fails you can "ROLLBACK" to the state of the database when you took the snapshot. If they all work and you're happy with it, you run "COMMIT" and it saves them all.

(Edit: transactions are only supported in relatively recent versions of MySQL, I don't think 4 supports them)

But really, you should be validating all data before it gets anywhere near the database before beginning any inserts.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
I've been using Postgres for the last 6-7 years, and I can't fault it. It's extremely robust, it's fully-featured, and the documentation is excellent. I can't compare performance but all the benchmarks I've seen put it equal to (if not better than) MySQL. My mind boggled when I found out that MySQL only relatively recently introduced transactions.

There may be certain situations where MySQL is better, but I think most people would be very happy with Postgres.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

UltraRed posted:

Why don't I want to store passwords using an md5() or SHA1() hash? What's insecure about them that makes everyone suggest going elsewhere?
It's probably fine to use md5 or sha1, but salt the passwords first to help prevent against rainbow table attacks.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

fletcher posted:

Is there a way I can validate a user id actually exists before I do something like relating a row in a different table to it?

Like when I'm going to insert a new photo for user id #5

INSERT INTO photo (filename, caption, owner) VALUES ("pic.jpg", "vacation", 5)

can I put something in the WHERE clause that will cause this query to fail if there is no user with id 5?
You want to put a "foreign key constraint" between the photos table and users table.

In PostgreSQL it looks something like this:
code:
ALTER TABLE photo ADD CONSTRAINT "user_exists"
FOREIGN KEY owner REFERENCES users (user_id)
ON UPDATE CASCADE
ON DELETE CASCADE;
If you attempt to insert a row into the photo table with an invalid user_id, the DB will throw an error.

The "ON UPDATE CASCADE" bit means that if the user ID gets updated to a different value, the photo table's user_id will be updated too.

The "ON DELETE CASCADE" bit means that if a user ever gets deleted from the user table, their rows in the photo table will be automatically removed too.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

Orbis Tertius posted:

I want to have a form that lets the user search a number of variables that are in several different tables and I want the resulting table to only display columns for the fields that were filled in on the form.

I'm having trouble figuring out how to do a SELECT query that will join only the necessary tables and search only the necessary variables based on what form fields the user fills in.
We have to deal with a similar problem. As the number of tables increases, the idea of having a single SQL statement becomes less appealing because we ended up with a huge illegible SQL statement. So ultimately we just ended up building the SQL on the fly.

This is not quite so simple, because it's necessary to keep track of which tables are required for the JOINs and SELECTs. Here's a very simple example, and it helps that we always use table "a" in the query:

php:
<?
$from_tables = array();
$selects = array();
$filters = array();

if($_REQUEST['some_b_param']) {
  $from_tables[] = "JOIN b USING (a.id = b.a_id)";
  $selects[] = "b.*";
  $filters[] = "b.foo = ".escape_sql($_REQUEST['some_b_param']);
}

if($_REQUEST['some_c_param']) {
  $from_tables[] = "JOIN c USING (a.id = c.a_id)";
  $selects[] = "c.*";
  $filters[] = "c.bar = ".escape_sql($_REQUEST['some_c_param']);
}

$sql = "SELECT a.* ".implode(",", $selects)."
       FROM a ".implode(" ", $from_tables)."
       WHERE true
         AND ".implode(" AND ", $filters);
?>

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

nbv4 posted:

I have a SQL problem that I can't even figure out where to begin. I need to get the date of the third from last occurance of a record. Say this is the table:
Firstly, generate a SQL statement that shows the date, the items, and the sum of all items up after and including that date. (I've used ID as the date in this example)
code:
SELECT id, items, (SELECT SUM(items)
                   FROM item AS item2
                   WHERE item2.id >= item.id) AS total
FROM item
ORDER BY item.id;

 id | items | total
----+-------+-------
  1 |     2 |     6
  2 |     0 |     4
  3 |     1 |     4
  4 |     1 |     3
  5 |     2 |     2
So you could add some restrictions to that clause to give you the one where the total is '3', and there's your 3rd to last record.

But that won't work in all cases: for example if you wanted the 5th to last record (ID 1 in my example), you'll get nothing back because there's no total with id 5. What you want is the row where the total is either the position you want, or failing that, the the next number after that. So let's tweak the SQL again:

code:
SELECT id, items
FROM item
WHERE (SELECT SUM(items)
                   FROM item AS item2
                   WHERE item2.id >= item.id) >= 3
ORDER BY total
LIMIT 1
Bingo.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
You might want to order by total, then by date, to avoid cases where there are some dates with 0 approaches.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
Isn't it as simple as:
code:
SELECT g.genre, p.publisher_name, o.owner_name
FROM genre AS g
   JOIN publisher AS p USING (genre_id)
   JOIN owner AS o ON (o.publisher_id = p.publisher_id)
ORDER BY g.genre, p.publisher_name, o.owner_name

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

fletcher posted:

Ah, that is what I needed. So what about the performance issue? It takes ~4s to run this query.
Make sure you have an index on both article.owner and user.id.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

Clobbersaurus posted:

I have kind of a general but easy newb question -

I'm looking to represent a large number of strings in a table. If I could compress them to ints (by a 1:1 hashing mechanism), would this improve searching time on the table? I've been thinking a lot about it (and I have very little SQL experience) but ultimately it comes down to: is a single varchar comparison slower than a single integer comparison? What if the strings were stored in fixed size char arrays?
Integer lookups are significantly faster than string lookups - but if your database engine's index system is worth it's salt, it'll be hashing any strings into integers behind the scenes, so there's probably little point in doing it yourself.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
I can't think of a way to do it without a subquery:
code:
SELECT t.id, t.key, t.value, t.version
FROM t
    JOIN (SELECT key, MAX(version) AS max_version
          FROM t
          GROUP BY key) AS sub_t ON (t.key=sub_t.key AND t.version=sub_t.version)

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
"FROM foo JOIN bar ON (condition)" is equivalent to "FROM foo, bar WHERE (condition)". The main reason for JOIN's existence is to help separate the table relationships from the rest of the filtering conditions.

With regards to indexing, this is what the EXPLAIN statement is for (if you can interpret the often cryptic results). A general rule of thumb is that if the DB engine has to perform a sequential scan on a large table, then you probably want an index on the column(s) it's using.

Don't forget about multi-column indices. E.g. if you have a table like "student_class (student_id, class_id, ...)" and you're looking up by both student_id and class_id simultaneously, then it may be faster to add an index on (student_id, class_id) than to have two separate indexes on student_id and class_id separately.

Adding indexes on ORDER BY columns generally hasn't helped at all in my experience.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

fletcher posted:

If you have a users table and an articles table, what do you guys name the column that links the article to the user? 'userId' or something more descriptive like 'authorId'?
Something more descriptive. Because you can be sure that eventually you'll have fields like "Created by", "Last updated by" and "Maintained by" that will all refer to user IDs, so you can't just use userId for those.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

fletcher posted:

Why does adding an index on one columnn speed up one SELECT query a lot while it slows down another SELECT query? Or should that not happen and something else is causing it?

Indexes should only improve the speed of SELECTs (though they may have adverse effects on the speed of UPDATEs and DELETEs). If adding an index is slowing a SELECT down, then I would guess that the query planner is stupidly using the index when a sequential search would be better (which is the case when the number of rows in the table is low). Not much you can do about that.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
Do not go for Option 1, ever. Horrible search performance, horrible update logic, no referential integrity, almost impossible to get aggregate statistics efficiently... the list goes on.

Option 2 is normalized. It's only unnormalized if you're considering having an integer tag_id to go with every tag name, and there may not be much point in that since it's just a 1-1 relationship. So I don't see too much difference between Option 2 and 3.

Are your tags case-sensitive? If not, you'll have to add "lower(tagname)" all over the place. This might affect indexing too - if your DB supports indexing on expressions, index on "lower(tagname)" rather than just tagname.

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
It's kinda tricky, but it is possible. The way I'd do it is to assign each factor/item combination a rank, and then choose only those with a rank <= 5.

Let's say for now that you don't want to randomize items, and that instead you want the 5 latest items (ordered by the "date" field). Your SQL might look like this:

code:
SELECT factors.id, items.id, (
        SELECT COUNT(*) + 1
        FROM items AS i2
        WHERE i2.factor_id = factors.id
          AND items.date > i2.date
        ) AS rank
   FROM factors
   INNER JOIN items ON factors.id = items.factor_id
   WHERE rank <= 5
   ORDER BY factors.id, rank
The sub-select works out the ranking of each row (with the latest date as rank 1). The WHERE clause only selects the items with rank <= 5, so you only get the "top" 5 items per factor.

I don't know if MySQL allows you to re-use a SELECT sub-query in the WHERE clause, you might have to duplicate the sub-query in the WHERE clause where I've put "rank".

So that gets you 5 rows per factor, but they'll always be the same 5 rows. The next step is to randomize which rows in items are ranked. That's a tricky one, since SQL doesn't really work like that. You have no control over the order of rows in which the SQL expressions will be evaluated over. SQL usually relies on its expressions being idempotent (input of X results in output of Y no matter what point in time it's called). So I guess you really want a function such that given an item and a random seed, it will hash the two together to generate a unique and non-random value that's "rankable", so you can draw out the top 5 items. Perhaps md5("random string" + item.id) might be enough (where "random string" is constant, but changes every time you make the SQL call).

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
That works for me in PostgreSQL. Are you sure you've spelt the columnname correctly? If the column name is the same as a Postgres keyword, try putting it in quotes (e.g. "A.columnname")

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
What database are you using? If you're using Postgres you can use something like:
code:
SELECT u.user_name, ARRAY(SELECT f.name
                        FROM facilities f
                           JOIN user_facilities uf USING (facility_id)
                        WHERE uf.user_id=u.user_id)
FROM users u;

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
What's that FROM clause doing on line 6?

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
Use a sub-select?
code:
SELECT *
FROM `ip_group_city` 
   JOIN (SELECT `ip_addr`
         FROM `visitor_ips` 
         WHERE `time` >= DATE_SUB( CURDATE( ) , INTERVAL 1 DAY ) 
         ORDER BY RAND() LIMIT 100) AS vis_ips
WHERE `ip_start` <= INET_ATON(vis_ips.ip_addr) 
ORDER BY `ip_start` DESC LIMIT 1

Adbot
ADBOT LOVES YOU

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
Use a sub-select, like
code:
UPDATE parents p
SET average = (SELECT AVG(pp.value)
               FROM relations pr
                   JOIN children pp ON (pp.id = pr.child_fk)
               WHERE p.id = pr.parent_fk)
WHERE p.value IS NULL;
(Not tested)

If you're using Postgres 8.4 you can use Common Table Expressions, but that might be overkill.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply