Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
I feel like it should be a rule for this thread that if you ask a question, you should state what RDBMS you are using. They vary so much in some aspects that it's frequently crucial information.

Adbot
ADBOT LOVES YOU

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Hammerite posted:

I feel like it should be a rule for this thread that if you ask a question, you should state what RDBMS you are using. They vary so much in some aspects that it's frequently crucial information.

I'm using Access 97 and I'm wondering if :smithicide:

FieryBalrog
Apr 7, 2010
Grimey Drawer
Yeah it's Oracle, sorry.

Stats exist for t_accounttree (histograms on all indexed columns). The problem is that stats don't exist for the intersection of accounttree and position- that has a different cardinality (per parent_id) than the accounttree table itself.

I also discovered that the way the histograms are structured for accounttree is leading the optimizer to lump together everything except the top 4 most popular values as "unpopular" and this binary way of handling the stats is compounding the issue. The top 4 parent_ids have 6400 associations each- they are the top level of the account tree hierarchy- and every other parent_id, whether it has 1250 associations or just 1, is considered to have an average cardinality of... 3 rows.

Edit: Unfortunately it looks like Oracle 11g doesn't support anything except frequency and height-based histograms. Can't use frequency because number of distinct values is 25,000 (max is 254!) and the height-based histogram is basically useless if not outright misleading for a huge swathe of the data. 12c has hybrid and top-frequency... unless anyone knows otherwise about this stuff being secretly available in 11g?

FieryBalrog fucked around with this message at 18:27 on Apr 25, 2014

Hadlock
Nov 9, 2004

Ho many GB should I plan on dedicating to SQL Server 2014 VM that handles a 4GB database with say, 4 simple queries a second?

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
I have what I think amounts to a data warehousing question for SQL Server 2012 and I'm not sure how to approach it.

I have a table consisting of user data from Folding@home that I have been capturing once an hour for the last year.

Table UserData has the structure:
code:
TimeStamp             UserName  Score      WorkUnits  TeamNumber
2014-04-29 12:34:56   Agrikk    33219802   20824      33
So basically, UserData contains an hourly snapshot for each of 1.5 million Folding@Home users for the last year.

There is one primary key on (TimeStamp, UserName, TEamNumber).

What I'm trying to figure out is how to build the underlying data structures will be accessed by a PHP web server to produce a page like this one that produces production data by user.

If I create a summary table that contains a single data point per user per day, I still have hundreds of millions of rows of data that need to be parsed to create a graph, and I just don't understand how to build a database that can return a query at web-browsing speeds.

I mean, the referenced page returns a query in less than a second, and I just don't understand how that is possible.

Erwin
Feb 17, 2006

Hadlock posted:

Ho many GB should I plan on dedicating to SQL Server 2014 VM that handles a 4GB database with say, 4 simple queries a second?

Which edition? If it's Express, 1GB. That's all it can use anyway. If it's anything else, up to 4GB. At that point, your whole database is cached in memory. You'd probably be fine with 1GB in either case since usage would be so low, especially if the queries are the same ones over and over. Those figures are on top of the operating system overhead, so say 2 and 5 respectively? You can always start low, and watch how often it goes to disk.

Agrikk posted:

I have what I think amounts to a data warehousing question for SQL Server 2012 and I'm not sure how to approach it.


Do you have the correct indexes?

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb

Agrikk posted:

I have what I think amounts to a data warehousing question for SQL Server 2012 and I'm not sure how to approach it.

I have a table consisting of user data from Folding@home that I have been capturing once an hour for the last year.

Table UserData has the structure:
code:
TimeStamp             UserName  Score      WorkUnits  TeamNumber
2014-04-29 12:34:56   Agrikk    33219802   20824      33
So basically, UserData contains an hourly snapshot for each of 1.5 million Folding@Home users for the last year.

There is one primary key on (TimeStamp, UserName, TEamNumber).

What I'm trying to figure out is how to build the underlying data structures will be accessed by a PHP web server to produce a page like this one that produces production data by user.

If I create a summary table that contains a single data point per user per day, I still have hundreds of millions of rows of data that need to be parsed to create a graph, and I just don't understand how to build a database that can return a query at web-browsing speeds.

I mean, the referenced page returns a query in less than a second, and I just don't understand how that is possible.

Does it have to generate it in real time on demand? Maybe just generate the summary report periodically with a cron job and serve up the generated report.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

fletcher posted:

Does it have to generate it in real time on demand? Maybe just generate the summary report periodically with a cron job and serve up the generated report.

It needs to do it on demand- A user arrives on the site and look up their own ID and the report is generated.

Tad Naff
Jul 8, 2004

I told you you'd be sorry buying an emoticon, but no, you were hung over. Well look at you now. It's not catching on at all!
:backtowork:
Hi fellas, I have a query that (I hope) could be faster:

code:
SELECT 
  I.`item_id`,
  I.`title`,
  I.`author`,
  I.`bibdata`,
  I.`callnumber`,
  I.`uri`,
  I.`hash`,
  I.`physical_format`,  
  GROUP_CONCAT(C.`course_id`, '\x1f', C.`lmsid` SEPARATOR '\x1e') AS lmsids 
FROM 
  `item` I 
  JOIN `course_item` CI USING(`item_id`) 
  JOIN `course` C USING(`course_id`) 
WHERE 
  ( I.`title` LIKE '%searchterm%' OR I.`bibdata` LIKE '%searchterm%' OR I.`author` LIKE '%searchterm%' ) 
GROUP BY 
  I.`item_id`
Using MySQL 5.5. bibdata is TEXT, others are VARCHAR or BIGINT as appropriate. All the tables are InnoDB. I already tried breaking bibdata out into a MyISAM table with fulltext indexing and using MATCH, but that actually turned out to be slower. I suppose what's killing it is the leading '%'s in the LIKEs, but I can't think of how to avoid that.

krnhotwings
May 7, 2009
Grimey Drawer

FeloniousDrunk posted:

Using MySQL 5.5. bibdata is TEXT, others are VARCHAR or BIGINT as appropriate. All the tables are InnoDB. I already tried breaking bibdata out into a MyISAM table with fulltext indexing and using MATCH, but that actually turned out to be slower. I suppose what's killing it is the leading '%'s in the LIKEs, but I can't think of how to avoid that.
As far as I understand it, joining on the separate indexed table for bibdata would be slower because you'd be doing an indexed filter on top of a full table scan rather than just doing the full table scan outright.

You could convert the items table to MyISAM with fulltext indexes and searching on bibdata, title, and author:

code:
SELECT 
  I.`item_id`,
  I.`title`,
  I.`author`,
  I.`bibdata`,
  I.`callnumber`,
  I.`uri`,
  I.`hash`,
  I.`physical_format`,  
  GROUP_CONCAT(C.`course_id`, '\x1f', C.`lmsid` SEPARATOR '\x1e') AS lmsids 
FROM 
  `item` I 
  JOIN `course_item` CI USING(`item_id`) 
  JOIN `course` C USING(`course_id`) 
WHERE MATCH(I.title, I.bibdata, I.author) AGAINST('searchterm')
GROUP BY 
  I.`item_id`
Any reason you're not using MyISAM? 'cause I think you're SOL otherwise. If you're trying to avoid MyISAM, you could upgrade to 5.6 :v: (5.6 has fulltext indexes and searching for InnoDB)

Edit: Earlier, I posted a pretty horrible suggestion, which I think is best if it's nuked away (it involved union selects for each LIKE and some other whacked poo poo.)

krnhotwings fucked around with this message at 17:07 on Apr 30, 2014

Tad Naff
Jul 8, 2004

I told you you'd be sorry buying an emoticon, but no, you were hung over. Well look at you now. It's not catching on at all!
:backtowork:

krnhotwings posted:

As far as I understand it, joining on the separate indexed table for bibdata would be slower because you'd be doing an indexed filter on top of a full table scan rather than just doing the full table scan outright.

You could convert the items table to MyISAM with fulltext indexes and searching on bibdata, title, and author:

code:
SELECT 
  I.`item_id`,
  I.`title`,
  I.`author`,
  I.`bibdata`,
  I.`callnumber`,
  I.`uri`,
  I.`hash`,
  I.`physical_format`,  
  GROUP_CONCAT(C.`course_id`, '\x1f', C.`lmsid` SEPARATOR '\x1e') AS lmsids 
FROM 
  `item` I 
  JOIN `course_item` CI USING(`item_id`) 
  JOIN `course` C USING(`course_id`) 
WHERE MATCH(I.title, I.bibdata, I.author) AGAINST('searchterm')
GROUP BY 
  I.`item_id`
Any reason you're not using MyISAM? 'cause I think you're SOL otherwise. If you're trying to avoid MyISAM, you could upgrade to 5.6 :v: (5.6 has fulltext indexes and searching for InnoDB)

Edit: Earlier, I posted a pretty horrible suggestion, which I think is best if it's nuked away (it involved union selects for each LIKE and some other whacked poo poo.)

Sneaky, I was going to reply to your previous post. InnoDB for transactions, which is kind of a necessity. The DB server is running Ubuntu 12.04.4, MySQL 5.5 is as high as that goes at the moment (I suppose ever, now that 14.04 is out) and I'd rather not compile my own. I did consider it, but the app I'm building is being considered for sharing with other places and that's not something I'd want to support. Maybe once 5.6 gets into standard repos.

Fortunately the query is still less than a second and it's only relevant in the administrative functions, not the student-facing part.

Tad Naff fucked around with this message at 20:08 on Apr 30, 2014

krnhotwings
May 7, 2009
Grimey Drawer

FeloniousDrunk posted:

Sneaky, I was going to reply to your previous post. InnoDB for transactions, which is kind of a necessity. The DB server is running Ubuntu 12.04.4, MySQL 5.5 is as high as that goes at the moment (I suppose ever, now that 14.04 is out) and I'd rather not compile my own. I did consider it, but the app I'm building is being considered for sharing with other places and that's not something I'd want to support. Maybe once 5.6 gets into standard repos.

Fortunately the query is still less than a second and it's only relevant in the administrative functions, not the student-facing part.
:ninja:

I figured transactions would be the showstopper (the usual..) Have you considered using Percona Server 5.6? They have apt repos. Coincidentally, I'm going to migrate my group's primary db server from Ubuntu 10.04, MySQL 5.1 to Ubuntu 12.04, Percona Server 5.6 (waiting on 14.04 for any potential issues.)

Given the technical limitations of the query, I would've said "gently caress it" and left it as is if it ran under a second.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

Agrikk posted:

I have what I think amounts to a data warehousing question for SQL Server 2012 and I'm not sure how to approach it.

I have a table consisting of user data from Folding@home that I have been capturing once an hour for the last year.

Table UserData has the structure:
code:
TimeStamp             UserName  Score      WorkUnits  TeamNumber
2014-04-29 12:34:56   Agrikk    33219802   20824      33
So basically, UserData contains an hourly snapshot for each of 1.5 million Folding@Home users for the last year.

There is one primary key on (TimeStamp, UserName, TEamNumber).

What I'm trying to figure out is how to build the underlying data structures will be accessed by a PHP web server to produce a page like this one that produces production data by user.

If I create a summary table that contains a single data point per user per day, I still have hundreds of millions of rows of data that need to be parsed to create a graph, and I just don't understand how to build a database that can return a query at web-browsing speeds.

I mean, the referenced page returns a query in less than a second, and I just don't understand how that is possible.

Any help with this?

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Agrikk posted:

Any help with this?
How long does it take to retrieve 8760 rows when this table is indexed on user? It seems there's some confusion regarding data scope here; while you have millions of rows overall (not a lot, by the way), there are only 10k hours per year, and your output resolution is likely only 5% of that.

You won't gain much from a summary table because it's only a factor of 20 in overall data size and the underlying database query for a user should be fast enough. If you had per-minute data, it would be a different story. Likewise, your needs may only be for a round robin data set.

In short, they've optimized to create that graph quickly. Are you just trying to duplicate that task, or something else, or with other features, etcetera?

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!
Postgres: One of my tables has gotten to 20mil rows and is adding them up at a rate of about 2mil/week. The rate is accelerating. Lookups are still blazing fast for the most part since it's still fitting into memory, and the typical query by primary key serial id will return 200-500 rows in no time.

I hear however that even for an add-only table things are about to get pretty lovely. Vacuuming (which right now autovacuum never does since I haven't rolled over the transaction counter and it's add-only) might start taking a week on a 500mil rows table etc.

I'm told that date-based partitioning is something worth looking into, and it fits my use-case pretty reasonably. Any other recommendations for once you get big-boy tables?

Xae
Jan 19, 2005

DreadCthulhu posted:

Postgres: One of my tables has gotten to 20mil rows and is adding them up at a rate of about 2mil/week. The rate is accelerating. Lookups are still blazing fast for the most part since it's still fitting into memory, and the typical query by primary key serial id will return 200-500 rows in no time.

I hear however that even for an add-only table things are about to get pretty lovely. Vacuuming (which right now autovacuum never does since I haven't rolled over the transaction counter and it's add-only) might start taking a week on a 500mil rows table etc.

I'm told that date-based partitioning is something worth looking into, and it fits my use-case pretty reasonably. Any other recommendations for once you get big-boy tables?

1) Have a good Indexing Strategy
2) Have a good Querying Strategy
3) Don't gently caress with settings unless you have to


If you can access the disk/hardware directly split the Indexes to a different drive than the data, I'm not sure about Postgress but this can help in Oracle and SQLServer.

If you're IO bound look at compression. If you're CPU bound (you probably aren't) don't.

Partitions are good for deleting data en mass, but can be a pain in the rear end to manage. Most DB engines require you to list all partitions, which can be a pain in the rear end to first setup, or if you forget to add a new year.


Edit: To Explain #3 - Don't try to out smart the database or trick it. Most clever solutions I have seen around "fixing" large tables end up causing a ton more problems than they solved.

Xae fucked around with this message at 06:18 on May 7, 2014

Cast_No_Shadow
Jun 8, 2010

The Republic of Luna Equestria is a huge, socially progressive nation, notable for its punitive income tax rates. Its compassionate, cynical population of 714m are ruled with an iron fist by the dictatorship government, which ensures that no-one outside the party gets too rich.

Due to reasons I've got ~4 weeks to learn as much as humanly possible from a DB point of view (inside a corporate environment), both design and implimentation.

I currently do dev work that has little to do with DB's beyond writing the odd SQL statement to pull\edit\insert some data.

Assume I know nothing beyond very basic Joins, SQL statements and what a relational database is.

Can anyone give me a list of things to read\resources that are helpful.

Janitor Prime
Jan 22, 2004

PC LOAD LETTER

What da fuck does that mean

Fun Shoe

DreadCthulhu posted:

Postgres: One of my tables has gotten to 20mil rows and is adding them up at a rate of about 2mil/week. The rate is accelerating. Lookups are still blazing fast for the most part since it's still fitting into memory, and the typical query by primary key serial id will return 200-500 rows in no time.

I hear however that even for an add-only table things are about to get pretty lovely. Vacuuming (which right now autovacuum never does since I haven't rolled over the transaction counter and it's add-only) might start taking a week on a 500mil rows table etc.

I'm told that date-based partitioning is something worth looking into, and it fits my use-case pretty reasonably. Any other recommendations for once you get big-boy tables?

We were in a similar position, but we just finished implemented a partitioning strategy. It's worked out great so far, you just have to be careful to always include the column used for your partition trigger in all your queries, otherwise it has to do a table scan across all the partitions!

We only ever need to keep 90 days worth of data and didn't have that much disk space to keep all the data anyway, so we implemented a 14 week rolling table strategy. The whole process is controlled by a python script that runs every Monday. It starts by dumping the oldest partition and uploading it to Amazon S3 as our backup. We can quickly restore this partition at any time in the future using the COPY command. Then it deletes the oldest partition, creates a new partition for next week and updates the partition trigger to remove the reference to the old partition and adds the new one. One note, creating a partition will not inherit the indexes or any of the foreign key, so we use the pg_catalog and friends to get the metadata from another partition and apply them to the new partition.

I should mention that we calculate daily aggregate statistics from these tables so there's really no need to even keep 14 weeks of data, but it's useful in case anyone needs to query the raw data instead of the summary statistics.

Chill Callahan
Nov 14, 2012

Cast_No_Shadow posted:

Due to reasons I've got ~4 weeks to learn as much as humanly possible from a DB point of view (inside a corporate environment), both design and implimentation.

I currently do dev work that has little to do with DB's beyond writing the odd SQL statement to pull\edit\insert some data.

Assume I know nothing beyond very basic Joins, SQL statements and what a relational database is.

Can anyone give me a list of things to read\resources that are helpful.

Knowing that a database is basically a B+ tree is very helpful. Don't know a good resource on normalizing databases, but read something on that. Also, this is really good for indexes: http://use-the-index-luke.com/sql/preface

raej
Sep 25, 2003

"Being drunk is the worst feeling of all. Except for all those other feelings."
Is there a quick way to select string before another string?

Example:
"Stone Brewing Company"
"Salmon Creek Brewing"
"Salara Brewery"

I want to only keep the text before the string " brew*"

Result:
"Stone"
"Salmon Creek"
"Salara"

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb
Something like this maybe?

code:
SELECT SUBSTRING('Stone Brewing Company', 1, LOCATE('brew', 'Stone Brewing Company')-1);

raej
Sep 25, 2003

"Being drunk is the worst feeling of all. Except for all those other feelings."

fletcher posted:

Something like this maybe?

code:
SELECT SUBSTRING('Stone Brewing Company', 1, LOCATE('brew', 'Stone Brewing Company')-1);

Everyone is going to rag on me for using Access for this (it's a csv import that needs cleaning up and it's quick and dirty)

But I'm getting errors with:
code:
SELECT breweries.[name], SUBSTRING(breweries.[name], 1, LOCATE(' brew', breweries.[name])-1)
FROM breweries;

Sir Bobert Fishbone
Jan 16, 2006

Beebort

fletcher posted:

Something like this maybe?

code:
SELECT SUBSTRING('Stone Brewing Company', 1, LOCATE('brew', 'Stone Brewing Company')-1);

Yep, or in MSSQL:

code:
select SUBSTRING('Stone Brewing', 0, CHARINDEX('Brew', 'Stone Brewing Company')-1)
e:

or try this in access:

code:
SELECT breweries.[name], MID(breweries.[name], 1, INSTR(1, breweries.[name], 'brew')-1)
FROM breweries;

Sir Bobert Fishbone fucked around with this message at 20:37 on May 7, 2014

gariig
Dec 31, 2004
Beaten into submission by my fiance
Pillbug

Cast_No_Shadow posted:

Due to reasons I've got ~4 weeks to learn as much as humanly possible from a DB point of view (inside a corporate environment), both design and implimentation.

I currently do dev work that has little to do with DB's beyond writing the odd SQL statement to pull\edit\insert some data.

Assume I know nothing beyond very basic Joins, SQL statements and what a relational database is.

Can anyone give me a list of things to read\resources that are helpful.

I think the general term for this is "accidental DBA". Where a developer is the most (but least) qualified person to run a SQL Server instance because a database uses a computer and developers use computers ergo developers are DBAs.

You should post what database your using. You can get better responses.

raej
Sep 25, 2003

"Being drunk is the worst feeling of all. Except for all those other feelings."

Sir Bobert Fishbone posted:



or try this in access:

code:
SELECT breweries.[name], MID(breweries.[name], 1, INSTR(1, breweries.[name], 'brew')-1)
FROM breweries;

Yes! Thank you!

Oh great SQL guru, can you help me with this last step then?

Two tables: Beers, Breweries (Simplified for the example)

Breweries
id,name

Beers
id,name,breweryid

A lot of the beers list a partial of the brewery name in the beer name. I want that removed.

code:
breweries

id|name                           
------------------------------
 1|Stone Brewing Company
 2|Stone City Brewing Company
 3|Peoples Pint
 4|Boston Beer Company
code:
beers

id|name                          |breweryid
-------------------------------------------
 1|Stone Smoked Porter           |1
 2|Stone City Iowa Pale Ale      |2
 3|Peoples Pint Brakeshoe Porter |3
 4|Sam Adams Blueberry Hill Lager|4
So what I'm seeing now when I combine a beer with its brewery are things like "Stone Brewing Company Stone Smoked Porter" which is a bit cumbersome.

Solution, remove the stripped down Brewery Name from the Beer name. The above SQL will get the stripped down Brewery Name, and I'm thinking there's a reverse part that will be able to get the stripped-brewery-name-removed-beer-name.

Logically, it would be something like:
For each beer, select everything after stripped down brewery name for this beer's brewery

raej fucked around with this message at 21:15 on May 7, 2014

EAT THE EGGS RICOLA
May 29, 2008

raej posted:

Everyone is going to rag on me for using Access for this (it's a csv import that needs cleaning up and it's quick and dirty)

But I'm getting errors with:
code:
SELECT breweries.[name], SUBSTRING(breweries.[name], 1, LOCATE(' brew', breweries.[name])-1)
FROM breweries;

If this is a one off and it's relatively huge, use Open Refine.

raej
Sep 25, 2003

"Being drunk is the worst feeling of all. Except for all those other feelings."
Checking out Open Refine. The sheer amount of data may make this difficult though. Breweries has over 14k entries, and Beers has over 185k entries. Doing anything by hand is not an option.

EAT THE EGGS RICOLA
May 29, 2008

raej posted:

Checking out Open Refine. The sheer amount of data may make this difficult though. Breweries has over 14k entries, and Beers has over 185k entries. Doing anything by hand is not an option.

That's a very small dataset for open refine! Good luck!

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!

There are several themes I've heard while doing my research on this topic:

- figure out a way to separate your system into OLTP and OLAP, where the latter is where you roll up your raw data asap and you allow for slower infrequent queries, and the OLTP one is where you store the more frequently accessed raw data / other stuff you don't need in the analytics portion of the system. You can use super cheap slow storage here. Basically caching all over again.
- partition the larger tables by something like date or id and allow for constrained queries. Either drop or archive stuff you don't need, or move it to the slower analytics portion of the system. Use partman to automate the whole thing.
- invest in fancier hardware to buy yourself more time

Never heard of the compression thing, that's interesting, something to look into.

DreadCthulhu fucked around with this message at 22:20 on May 7, 2014

Sir Bobert Fishbone
Jan 16, 2006

Beebort

raej posted:

Yes! Thank you!

Oh great SQL guru, can you help me with this last step then?

Two tables: Beers, Breweries (Simplified for the example)

Breweries
id,name

Beers
id,name,breweryid

A lot of the beers list a partial of the brewery name in the beer name. I want that removed.

code:
breweries

id|name                           
------------------------------
 1|Stone Brewing Company
 2|Stone City Brewing Company
 3|Peoples Pint
 4|Boston Beer Company
code:
beers

id|name                          |breweryid
-------------------------------------------
 1|Stone Smoked Porter           |1
 2|Stone City Iowa Pale Ale      |2
 3|Peoples Pint Brakeshoe Porter |3
 4|Sam Adams Blueberry Hill Lager|4
So what I'm seeing now when I combine a beer with its brewery are things like "Stone Brewing Company Stone Smoked Porter" which is a bit cumbersome.

Solution, remove the stripped down Brewery Name from the Beer name. The above SQL will get the stripped down Brewery Name, and I'm thinking there's a reverse part that will be able to get the stripped-brewery-name-removed-beer-name.

Logically, it would be something like:
For each beer, select everything after stripped down brewery name for this beer's brewery

I don't have Access in front of me, but could you try something like

code:
SELECT breweries.[name], MID(breweries.[name], 1, INSTR(1, breweries.[name], 'brew')-1), MID(beers.[name], LEN(MID(breweries.[name], 1, INSTR(1, breweries.[name], 'brew')-1)), LEN(beers.name])
FROM beers INNER JOIN breweries on breweries.id = beers.breweryid;
?

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

DreadCthulhu posted:

Postgres: One of my tables has gotten to 20mil rows and is adding them up at a rate of about 2mil/week. The rate is accelerating. Lookups are still blazing fast for the most part since it's still fitting into memory, and the typical query by primary key serial id will return 200-500 rows in no time.

I hear however that even for an add-only table things are about to get pretty lovely. Vacuuming (which right now autovacuum never does since I haven't rolled over the transaction counter and it's add-only) might start taking a week on a 500mil rows table etc.

I'm told that date-based partitioning is something worth looking into, and it fits my use-case pretty reasonably. Any other recommendations for once you get big-boy tables?

Yeah, you want to start by turning off that trigger function called ON INSERT insert_exponential_growth. :razz:

So, first point: Twenty million rows isn't nothing, but it's also not too much. I'd be more concerned with "twenty million rows of what?", because 20M*(id,name) is nothing, whereas 20M rows with a bunch of triggered foreign key inserts is quite another story. If you're only doing inserts (and, presumably, queries), then, as you've identified, your operational limit tends to be related to total memory. You alleviate those immediate concerns with:

Xae posted:

1) Have a good Indexing Strategy
2) Have a good Querying Strategy
3) Don't gently caress with settings unless you have to

Second point, where are you hearing the system breaks down after 20M rows? Where have you heard that a vacuum on this many rows will take days or weeks? Surely you already know what VACUUM does, oh friendly DreadCthulhu, as I know you're a good study of PG, but it sounds like you're forgetting the stated purpose of vacuum: "VACUUM reclaims storage occupied by dead tuples." So where are your dead tuples? Why would you bother running VACUUM FULL? Or are you thinking of using CLUSTER to improve query performance? How long does VACUUM take right now?

Third, you get the paycheck because you're the one with the information needed to solve the problem. ;) How has your performance degraded with increased inserts? What are your performance needs for queries versus inserts; i.e., which is more important? Are you offloading queries via replication? Can replication keep up or are you running out of segments? Which limit will you hit first: Disk storage, disk IO, CPU saturation, memory exhaustion, or network IO? What do the stats look like on your indexes? Are your queries doing the right thing or can they be optimized?

How have your backups been behaving? When will you hit an issue with your expected backup schedule? What happens for failover?

If you intend to do silly things like field compression (don't, you need a better reason for this crap than you would for changing table parameters from defaults with this design), how do you expect that to affect your queries? Index lookups and page hits are likely faster than decompression for any data that would conceivably benefit from it. If it screws up your data model, then it's likely even more time invested for reasonably small gains. (There are, as always, exceptions.)

Partitions can be reasonable, but it depends greatly on your partitioning scheme. If, for example, you partition out monthlies, are you suggesting that you don't refer to that data in queries? Or do you only select data within given months? If you only run queries against recent data, and are only proposing to keep partitions for historical purposes, why aren't you using a partial index (that fits in memory) instead of all the hassle of partitioning? If you do plan to partition, will it affect your uptime, need for more maintenance windows, backup strategies, and so forth.

It's good that you consider this stuff ahead of time, but don't over-inflate others' notions of "severe difficulties". I've seen 200GB data sets running 50GB of updates daily on systems with 8GB of memory; >99% insert systems with 2B rows running smooth except for some backup hiccups; and 2GB data sets running 10MB of daily traffic operating at a complete loving crawl that required daily VACUUM FULLs. I'll give you a good guess with of the three generated the most customer complaints about hardware+virtual performance; I'll also let you guess which of the three had the worst queries, lack of consideration for proper indexes and replication, and was otherwise pretty much just mishandled by the customer.

Wicaeed
Feb 8, 2005
I can't believe I'm being asked to do this since:

#1 I'm not a DBA (Seriously I don't know what a DB Table is/means :( )
#2 We have DBA's on staff that can/should be doing this

But alas, they neither know nor care.

Company is in the process of rebuilding our Billing environment. Currently we have a system where one billing db server acts as a primary, replicating any changes to a secondary node. If the primary node fails, the secondary picks up where the other left off. This current system is running MSSQL 2005 (I think).

They want a system that is 100% up all the time :rolleyes:

To my uneducated self, that means a Windows Server Failover Cluster (WSFC) running SQL server. This is where it gets kind of confusing.

What is the difference between running a simple WSFC w/SQL Server (Active/Passive node) vs. running two SQL servers in an Always-On high availability group vs. running a WSFC w/SQL Server in an Always-On high availability group?

A really broad question, I know :(

raej
Sep 25, 2003

"Being drunk is the worst feeling of all. Except for all those other feelings."

Sir Bobert Fishbone posted:

I don't have Access in front of me, but could you try something like

code:
SELECT breweries.[name], MID(breweries.[name], 1, INSTR(1, breweries.[name], 'brew')-1), MID(beers.[name], LEN(MID(breweries.[name], 1, INSTR(1, breweries.[name], 'brew')-1)), LEN(beers.name])
FROM beers INNER JOIN breweries on breweries.id = beers.breweryid;
?

code:
SELECT beers.[name], 
breweries.[name], 
MID(breweries.[name], 1, INSTR(1, breweries.[name], 'brew')-1) AS brewerysmall,
MID(beers.[name], LEN(MID(breweries.[name], 1, INSTR(1, breweries.[name], 'brew')-1)), LEN(beers.[name])) AS beersmall
FROM beers, breweries
WHERE breweries.[id] = beers.[breweryid]
AND breweries.[name] LIKE "* brew*"
;
This is the cleaned up version. Unfortunately, since not all of the beer names start with the brewery name, parts of beer names are taken off that don't need to be.

How would I add validation like "If beer.[name] starts with brewerysmall"?

I tried adding "AND beers.[name] LIKE MID(breweries.[name], 1, INSTR(1, breweries.[name], 'brew')-1)" in the WHERE clause, but nothing returns.

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb
Trying to do everything with database queries seems like it will be pretty cumbersome. I would just write a little python program that uses a sqlite database or something to massage your data.

Sir Bobert Fishbone
Jan 16, 2006

Beebort
To use LIKE, you might have to do something with wildcards like

code:
 AND beers.[name] LIKE '*' & MID(breweries.[name], 1, INSTR(1, breweries.[name], 'brew')-1) & '*'
or else it'll still return only the literal string

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

Erwin posted:

Do you have the correct indexes?

Turns out I didn't. :bonk: After adding the right indexes, query times dropped from minutes to sub-seconds.

Fuck them
Jan 21, 2011

and their bullshit
:yotj:
I've got a fun update to do - no, I can't MERGE anything, since the DB I'm to touch is running in 2005 compatibility mode.

I'm updating some $DOCUMENT codes to reflect changes to a lookup table that the $DOCUMENT codes point to, using a temporary table I imported to know which ones to mess with. There are some in the temp table that do not match with the existing table, so that makes me think hmm, MERGE. Nope.

When I get home I'll flesh it out more. I'm used to writing baby joins on normalized data, actually doing this kind of stuff is new, and I want to learn to do it the right way.

Any good strategies to picking this apart?

fankey
Aug 31, 2001

Database 101 question. What would be a good way to organize a music library database ( SQLite ) assuming each song had (file,genre,artist,album,track,year). I've started reading up on databases and apparently it's important to normalize your database or it will all collapse in a fiery ball. I'm wondering how far to take things. It seems like I would want to have a (genre_id,genre) table and possibly a (artist_id,artist) one as well? That would make the main table (file,genre_id,artist_id,album,track,year)? Is there a reason to go crazy and also have a (album_id,album) table as well?

I won't be editing the database from my application if that matters.

karms
Jan 22, 2006

by Nyc_Tattoo
Yam Slacker

fankey posted:

Database 101 question. What would be a good way to organize a music library database ( SQLite ) assuming each song had (file,genre,artist,album,track,year). I've started reading up on databases and apparently it's important to normalize your database or it will all collapse in a fiery ball. I'm wondering how far to take things. It seems like I would want to have a (genre_id,genre) table and possibly a (artist_id,artist) one as well? That would make the main table (file,genre_id,artist_id,album,track,year)? Is there a reason to go crazy and also have a (album_id,album) table as well?

I won't be editing the database from my application if that matters.

Yep, you want an albums table as well, so this gives you the most flexibility when you want to filter the data in any way.

Adbot
ADBOT LOVES YOU

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


KARMA! posted:

Yep, you want an albums table as well, so this gives you the most flexibility when you want to filter the data in any way.

I'd also argue for the inclusion of an artist-album intermediate table, or hell: artist-song, so you could capture collaborative efforts. Sure, it's a pain in the rear end, but depending on the level of detail you want in your searches, it is the best way to find your copy if We Are The World when looking for Bob Geldof tracks. :v:

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply