Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Janitor Prime
Jan 22, 2004

PC LOAD LETTER

What da fuck does that mean

Fun Shoe

DreadCthulhu posted:

I think my question was mostly about building parameterized SQL queries procedurally in the application layer, sorry for the confusion.

He should be the one apologizing, that post made no sense.

Also #1, it's because it makes it easier to see which columns are being to join the different tables, instead of trying to guess it from the order in the WHERE clause. Also if the column has the same name in both tables you can use the USING keyword instead of the ON keyword as a shortcut.
#2 I don't personally believe in, I've yet to see a convincing argument.

Janitor Prime fucked around with this message at 06:14 on Sep 8, 2013

Adbot
ADBOT LOVES YOU

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

Hard NOP Life posted:

#2 I don't personally believe in, I've yet to see a convincing argument.

It's important to explicitly list the columns you're selecting if you are using a SQL interface which just hands you back rows as simple arrays instead of as associative lists, and you consequently have to depend on the order that columns are returned in. Especially because some backup systems or replication systems may not preserve the natural order of columns.

That said, SQL interfaces which don't return the names of the columns are quite uncommon and very stupid and you'd be far better off just avoiding them entirely. Hell, even in the 90s it was hard to find a SQL client that forced you into such a dumb operating mode.

There's also the edge case where one of your columns may contain very large data, and you may not actually need it, in which case you should avoid selecting it because you're making the database do extra work to move that data around. But that's similarly uncommon and not very motivating for a general case.

Beelzebozo
Nov 6, 2002

I don’t mean to sound bitter, cold, or cruel. But I am, so that’s how it comes out.
The biggest argument against select * is that someone could change the table schema behind your code's back. If you create a function that expects a certain number or order of columns it could break unexpectedly.

Impotence
Nov 8, 2010
Lipstick Apathy
Not sure if this is the correct thread, but I'm looking for a database right now. It would consist of inserts (99%) and rarely any updates or deletes, with around 100 columns all storing an int type with anywhere from 0 to 15 characters. I need to be able to fit billions of rows in as little disk space as possible. Many of the columns would be zero, too.

Any ideas on what to do that preferably don't involve some kind of abomination of a separate deduped partition running or software suggestions? Postgres apparently compresses large data but I'm not too sure what large is and I'm reasonably sure that individual rows will never be past 1-2KB each uncompressed. No idea what MySQL does or what type (innodb? myisam?) to use.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


ShoulderDaemon posted:

There's also the edge case where one of your columns may contain very large data, and you may not actually need it, in which case you should avoid selecting it because you're making the database do extra work to move that data around. But that's similarly uncommon and not very motivating for a general case.
Contrariwise, if you use SELECT * in an EXISTS clause, SQL Server it's smart enough not to grab any data. That's the only business case I've found for it.

Janitor Prime
Jan 22, 2004

PC LOAD LETTER

What da fuck does that mean

Fun Shoe

Beelzebozo posted:

The biggest argument against select * is that someone could change the table schema behind your code's back. If you create a function that expects a certain number or order of columns it could break unexpectedly.

"If you follow this bad practice then it could break your code", thanks for proving my point about there not being a convincing argument.

ShoulderDaemon posted:

It's important to explicitly list the columns you're selecting if you are using a SQL interface which just hands you back rows as simple arrays instead of as associative lists, and you consequently have to depend on the order that columns are returned in. Especially because some backup systems or replication systems may not preserve the natural order of columns.

That said, SQL interfaces which don't return the names of the columns are quite uncommon and very stupid and you'd be far better off just avoiding them entirely. Hell, even in the 90s it was hard to find a SQL client that forced you into such a dumb operating mode.
I've never seen a client this dumb in any of the databases I've worked with, but thanks for sharing, you never know what archaic system you'll run into.

ShoulderDaemon posted:

There's also the edge case where one of your columns may contain very large data, and you may not actually need it, in which case you should avoid selecting it because you're making the database do extra work to move that data around. But that's similarly uncommon and not very motivating for a general case.

This isn't such an edge case, it needs to be considered each time you use SELECT * in your queries.

I'm not saying that you should use SELECT * everywhere because you're lazy, just that the qualifier NEVER in that statement is wrong. The best use case for it is when you're mapping objects to a table and each field has a corresponding column. I also wanted to say that if the underlying schema changes by say renaming a column without the corresponding code knowing about it, then it would still have broken even if you had listed out each column by name in the query.

Xae
Jan 19, 2005

Biowarfare posted:

Not sure if this is the correct thread, but I'm looking for a database right now. It would consist of inserts (99%) and rarely any updates or deletes, with around 100 columns all storing an int type with anywhere from 0 to 15 characters. I need to be able to fit billions of rows in as little disk space as possible. Many of the columns would be zero, too.

Any ideas on what to do that preferably don't involve some kind of abomination of a separate deduped partition running or software suggestions? Postgres apparently compresses large data but I'm not too sure what large is and I'm reasonably sure that individual rows will never be past 1-2KB each uncompressed. No idea what MySQL does or what type (innodb? myisam?) to use.

Are you looking to put 100 columns on one table?

RoadCrewWorker
Nov 19, 2007

camels aren't so great
PostgreSQL compression works on storing larger, variable sized columns (like binary or text strings) in "toast" tables that (by default) use compression. From my attempts to transfer databases from mysql to pg the former generally takes up a far smaller storage footprint, but i'm not certain how that applies to this special case.

Xae posted:

Are you looking to put 100 columns on one table?
~100 bigint columns with a low % of NZE, too (sparse Nx100 matrix?). Not the kind of dataset i usually handle in sql, i always assume there are far more optimized storage and processing formats for those.

RoadCrewWorker fucked around with this message at 06:18 on Sep 9, 2013

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!
So the UNIQUE constraint is implemented with an index, whereas other constraints aren't. Is there a bigger insight to be derived from this fact, or is this just an implementation detail? Can you leverage this fact beyond just knowing that now you get an index for free on that column?

Splinter
Jul 4, 2003
Cowabunga!
Is there a website out there for practicing select queries? I'm giving my friend a crash course on writing select queries and it'd be nice if were a site out there that has some basic tables populated that could be queried through a web interface so he could get some practice. Does anything like this exist?

Also, anyone have any recommendations for either books or online courses (e.g. coursera, code academy) for learning basic SQL?

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!

Splinter posted:

Is there a website out there for practicing select queries? I'm giving my friend a crash course on writing select queries and it'd be nice if were a site out there that has some basic tables populated that could be queried through a web interface so he could get some practice. Does anything like this exist?

Also, anyone have any recommendations for either books or online courses (e.g. coursera, code academy) for learning basic SQL?

There's a Stanford online course which is pretty close to your standard undergraduate cs course in databases, if I recall correctly.

Opulent Ceremony
Feb 22, 2012

Splinter posted:

Is there a website out there for practicing select queries

sqlzoo.net

Knyteguy
Jul 6, 2005

YES to love
NO to shirts


Toilet Rascal
Does anyone know what I can do to speed up this query? I'm 8 minutes in and have only managed to select about 33,000 records, when the total data set would be roughly 3 times that on just one of them:

SQL code:
INSERT INTO [dbo].[RelatedProduct] ([ProductId1],[ProductId2],[DisplayOrder])
		SELECT P.Id,
			(SELECT P2.Id
			 FROM Product P2
			 INNER JOIN MasterDB.dbo.ProductData PD2 ON P2.Sku = PD2.ItemId
			 WHERE PD2.ExtProductId = PS.similarproductid) AS RelatedProductID,
			 1 AS DisplayOrder
		FROM Product P
		INNER JOIN MasterDB.dbo.ProductData PD ON P.Sku = PD.ItemId
		INNER JOIN MasterDB.dbo.SPRStageProductSimilar PS ON PD.ExtProductId = PS.productid
		INNER JOIN MasterDB.dbo.ProductData PDx ON PS.similarproductid = PDx.ExtProductId 
		-- Make sure the Xref is valid
E: Nevermind, I added (my boss added) an index on Product SKU and it about 5 seconds.

I have to run another one that is basically exactly the same (into another table).

Thanks.

Knyteguy fucked around with this message at 23:17 on Sep 10, 2013

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!
You guys mentioned a while back that when just starting out you probably don't need to worry too much about Postgres maintenance. I'm curious what things are pretty important now that the database is slowly growing and likely to grow at a faster rate in the future. Things like autovacuum etc.

IAmKale
Jun 7, 2007

やらないか

Fun Shoe
I need some help optimizing the following query:

SQL code:
SELECT
kd2c.id as _id,
kd2c.literal as kanji
FROM
kd2_character as kd2c
JOIN krad_components as kcom ON kcom.kanji_fk = kd2c.id
WHERE kcom.radical_fk IN (1, 2, 3, etc...)
GROUP BY kd2c.id HAVING count(distinct kcom.radical_fk) = <number of integers in WHERE clause>
ORDER BY kd2c.freq IS NULL ASC, kd2c.freq, kd2c.id
The query itself (regardless of the number of fk's in the WHERE clause) takes 0.04 seconds to run, which is a long time relative to all of my other queries that take around 0.0003 seconds. I ran an EXPLAIN QUERY PLAN against the above statement and received the following:

code:
# | selectid | order | from | detail 
1	0	0	0	SCAN TABLE kd2_character AS kd2c USING INTEGER PRIMARY KEY (~1000000 rows)
2	0	1	1	SEARCH TABLE krad_components AS kcom USING COVERING INDEX idx_krad_components (kanji_fk=? AND radical_fk=?) (~9 rows)
3	0	0	0	EXECUTE LIST SUBQUERY 1
4	0	0	0	USE TEMP B-TREE FOR ORDER BY
I'm pretty sure the script takes so long because of that initial SCAN TABLE. If that's the case, how can I get rid of it? I thought creating an index on kd2_characer.id would help things along, but
it didn't have any noticeable effect on execution time. Is there a better way to structure my GROUP BY, since it's probably the source of the SCAN?

a foolish pianist
May 6, 2007

(bi)cyclic mutation

I'm trying to aggregate some data from a forum, and what I'd like to get out of my query is something like this:

code:
user_id1, postcount-week1, postcount-week2, postcount-week3,..., postcount-weekn
user_id2, postcount-week1, postcount-week2, postcount-week3,..., postcount-weekn
...
for each user. The database just contains a list of posts, with the user_id, the date and time, and the post.

For individual weeks, it's simple enough:
code:
SELECT user_id, COUNT(post_id)
FROM post_table
WHERE WEEK(post_date) = $whateverweek
GROUP BY user_id
Is there a way to get the user_id and the counts for each week in a line?

McGlockenshire
Dec 16, 2005

GOLLOCKS!

a foolish pianist posted:

Is there a way to get the user_id and the counts for each week in a line?

Are you looking for a pivot table? You didn't specify which SQL server you're using, so check to see if it can do them natively.

It's a bit of a hack, but you can do it by hand... just not cleanly in one query. Create a temporary table with the correct columns, then run one query per week that inserts/updates the row accordingly. Or just return all of the rows sorted by user and then week and put it together yourself outside of the database.

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

Karthe posted:

I need some help optimizing the following query:

SQL code:
SELECT
kd2c.id as _id,
kd2c.literal as kanji
FROM
kd2_character as kd2c
JOIN krad_components as kcom ON kcom.kanji_fk = kd2c.id
WHERE kcom.radical_fk IN (1, 2, 3, etc...)
GROUP BY kd2c.id HAVING count(distinct kcom.radical_fk) = <number of integers in WHERE clause>
ORDER BY kd2c.freq IS NULL ASC, kd2c.freq, kd2c.id

The query itself (regardless of the number of fk's in the WHERE clause) takes 0.04 seconds to run, which is a long time relative to all of my other queries that take around 0.0003 seconds. I ran an EXPLAIN QUERY PLAN against the above statement and received the following:

code:
# | selectid | order | from | detail 
1	0	0	0	SCAN TABLE kd2_character AS kd2c USING INTEGER PRIMARY KEY (~1000000 rows)
2	0	1	1	SEARCH TABLE krad_components AS kcom USING COVERING INDEX idx_krad_components (kanji_fk=? AND radical_fk=?) (~9 rows)
3	0	0	0	EXECUTE LIST SUBQUERY 1
4	0	0	0	USE TEMP B-TREE FOR ORDER BY

I'm pretty sure the script takes so long because of that initial SCAN TABLE. If that's the case, how can I get rid of it? I thought creating an index on kd2_characer.id would help things along, but
it didn't have any noticeable effect on execution time. Is there a better way to structure my GROUP BY, since it's probably the source of the SCAN?


Can you up the verbosity on the explain query plan? My gut tells me that something in the order by or having (likely the distinct) is messing with things, but nothing really seems to stick out. Also, you have a group by but one of your output cols is not in an aggregate function so stricter DBMS like PostgreSQL will bitch at you since it's against the standard and undefined behavior.


Primary keys usually create an implicit index.

Fake edit : Suppose you dropped the having clause and removed the order by null asc bit : does it speed up? I want to say that the count distinct is expensive but that's just a hunch

You could try building the result backwards, by first grouping the components by kanji fk , filtering out those that don't have a full set of radical_fk matching the integers, then joining that to the character table to get the literal.

Or you could have a query that only works on one integer at a time (eliminating the having clause) and then intersecting to get what you want

IAmKale
Jun 7, 2007

やらないか

Fun Shoe

Malcolm XML posted:

Can you up the verbosity on the explain query plan? My gut tells me that something in the order by or having (likely the distinct) is messing with things, but nothing really seems to stick out.
I ran ANALYZE on the database, then ran EXPLAIN QUERY PLAN again. Here's the updated output:

code:
# | selectid | order | from | detail 
1	0	0	0	SCAN TABLE kd2_character AS kd2c USING INTEGER PRIMARY KEY (~13108 rows)
2	0	1	1	SEARCH TABLE krad_components AS kcom USING COVERING INDEX idx_krad_components (kanji_fk=? AND radical_fk=?) (~2 rows)
3	0	0	0	EXECUTE LIST SUBQUERY 1
4	0	0	0	USE TEMP B-TREE FOR ORDER BY

Malcolm XML posted:

Also, you have a group by but one of your output cols is not in an aggregate function so stricter DBMS like PostgreSQL will bitch at you since it's against the standard and undefined behavior.
Can you elaborate more on this? I'm using SQLite and haven't experienced any kickback in instances like this, so I'm not sure what it means when an output column isn't part of an aggregate function within the query.

Malcolm XML posted:

Fake edit : Suppose you dropped the having clause and removed the order by null asc bit : does it speed up? I want to say that the count distinct is expensive but that's just a hunch
I played around with the query and found out that dropping HAVING in the GROUP BY shaved off about 0.015 seconds. Unfortunately, the HAVING is necessary or else the number of results returned increases whenever I add additional numbers to the WHERE...IN clause.

Malcolm XML posted:

You could try building the result backwards, by first grouping the components by kanji fk , filtering out those that don't have a full set of radical_fk matching the integers, then joining that to the character table to get the literal.
I see where you're going with this. I'll try working from krad_components first and then joining to kd2_character.

Edit: It's the ORDER BY that's causing the slow down! I removed the kd2c.freq IS NULL ASC, kd2c.freq part of the query and execution time dropped to about 0.0004 seconds, in line with my other queries. I created an index on kd2_character.freq and kd2_character.id thinking it would take care of the USE TEMP B-TREE FOR ORDER BY, but that didn't help at all.

IAmKale fucked around with this message at 15:28 on Sep 17, 2013

BabyFur Denny
Mar 18, 2003

a foolish pianist posted:

I'm trying to aggregate some data from a forum, and what I'd like to get out of my query is something like this:

code:
user_id1, postcount-week1, postcount-week2, postcount-week3,..., postcount-weekn
user_id2, postcount-week1, postcount-week2, postcount-week3,..., postcount-weekn
...
for each user. The database just contains a list of posts, with the user_id, the date and time, and the post.

For individual weeks, it's simple enough:
code:
SELECT user_id, COUNT(post_id)
FROM post_table
WHERE WEEK(post_date) = $whateverweek
GROUP BY user_id
Is there a way to get the user_id and the counts for each week in a line?
code:
SELECT
user_id, 
COUNT(YEARWEEK(post_date) = $week1 or null) AS week1_posts, 
COUNT(YEARWEEK(post_date) = $week2 or null) AS week2_posts, 
etc. etc.
COUNT(YEARWEEK(post_date) = $weekn or null) AS weekn_posts
FROM post_table
WHERE YEARWEEK(post_date) between $week1 and $weekn
GROUP BY user_id

BabyFur Denny fucked around with this message at 14:53 on Sep 17, 2013

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

Karthe posted:

I ran ANALYZE on the database, then ran EXPLAIN QUERY PLAN again. Here's the updated output:

code:
# | selectid | order | from | detail 
1	0	0	0	SCAN TABLE kd2_character AS kd2c USING INTEGER PRIMARY KEY (~13108 rows)
2	0	1	1	SEARCH TABLE krad_components AS kcom USING COVERING INDEX idx_krad_components (kanji_fk=? AND radical_fk=?) (~2 rows)
3	0	0	0	EXECUTE LIST SUBQUERY 1
4	0	0	0	USE TEMP B-TREE FOR ORDER BY

Can you elaborate more on this? I'm using SQLite and haven't experienced any kickback in instances like this, so I'm not sure what it means when an output column isn't part of an aggregate function within the query.

I played around with the query and found out that dropping HAVING in the GROUP BY shaved off about 0.015 seconds. Unfortunately, the HAVING is necessary or else the number of results returned increases whenever I add additional numbers to the WHERE...IN clause.

I see where you're going with this. I'll try working from krad_components first and then joining to kd2_character.

Edit: It's the ORDER BY that's causing the slow down! I removed the kd2c.freq IS NULL ASC, kd2c.freq part of the query and execution time dropped to about 0.0004 seconds, in line with my other queries. I created an index on kd2_character.freq and kd2_character.id thinking it would take care of the USE TEMP B-TREE FOR ORDER BY, but that didn't help at all.

Yeah non standard null handling will kill performance. Null sucks, you could try doing 2 queries, one with just nulls and one without nulls and merge them later

PleasantDilemma
Dec 5, 2006

The Last Hope for Peace
Does this query do what I think it does?
code:
SELECT shipid,
       codecount
FROM   codestable
GROUP  BY shipid
ORDER  BY codeid DESC;
This is MySQL and codeid is the autoincrement field. I think this gives me the most recent codecount for each shipid. Is that correct?

And if it is correct, how can I use this to update a different table? I want to put the most recent codecount value into the shipstable

so my schema is like this:
code:
codestable:
codeid   shipid    codedate     codecount

shipstable:
shipid   shipname  codecount

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

PlesantDilemma posted:

Does this query do what I think it does?
code:
SELECT shipid,
       codecount
FROM   codestable
GROUP  BY shipid
ORDER  BY codeid DESC;
This is MySQL and codeid is the autoincrement field. I think this gives me the most recent codecount for each shipid. Is that correct?

No, that is not correct. You will get the codecount from an arbitrary row from amongst the rows matching each shipid value. In practice, it might happen that on certain machines, under certain conditions you find that you get the most recent value, but you cannot rely on this.

http://dev.mysql.com/doc/refman/5.0/en/group-by-extensions.html posted:

MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. ...However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.

Use instead

code:
SELECT
    codestable.shipid,
    codestable.codecount
FROM
    codestable
    JOIN (
        SELECT
            shipid,
            MAX(codeid) AS MaxCodeId
        FROM
            codestable
        GROUP BY
            shipid
    ) AS MySubquery ON codestable.codeid = MySubquery.MaxCodeId
ORDER BY
    codestable.codeid DESC;
This can also be JOINed into an UPDATE query.

IAmKale
Jun 7, 2007

やらないか

Fun Shoe
Is there a mechanism within SQL in general (SQLite specifically) that would prevent me from inserting a row of information that matches, column for column, an existing row? For example, in my app, users can tag individual words with several vocabulary tags at once. Right now, a user must tag words one at a time. I'm looking to implement a feature that would let them highlight multiple words at once and then tag them en mass.

I currently store the tag ID and word ID in a table like this:
SQL code:
id | wordID | tagID
1 | 1 | 1
2 | 1 | 2
3 | 2 | 3
I'm hoping I can set things up in the database so that, if a user selects Word 1 and Word 2 to simultaneously tag them with Tag 2, the SQLite database will ignore the INSERT command that is issued to insert Word 1 | Tag 2 since Word 1 was already tagged with Tag 2. Ultimately the app would be none the wiser if the INSERT doesn't actually commit anything to the database.

The alternative would be running a SELECT statement for each combination of Word and Tag that's selected and then only running an INSERT command if 0 rows are returned for a particular combination.

BabyFur Denny
Mar 18, 2003
Create a unique index (wordID, tagID) and do an INSERT IGNORE for any new tags. If the combination already exists in the database, it is silently dropped.

FlyWhiteBoy
Jul 13, 2004
Can someone recommend a book about SQL Security? I haven't found any clear favorites on Amazon or Google.

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

FlyWhiteBoy posted:

Can someone recommend a book about SQL Security? I haven't found any clear favorites on Amazon or Google.

MySQL? MSSQL? The problem with SQL security is that:
- The majority of the problems are from the underlying presentation language (php, .net, etc.) each of which have their own quirks
- The SQL platforms by themselves are actually pretty secure, except when
- You write bad code and queries that allow them to be exploited (see point 1)

So a generic SQL Security book would be pretty hard to write, other than a couple pages of "how not to write injectable code". However, there are things specific to each platform, such as the permission escalation bug in older versions of MSSQL, that could stand to be documented specifically. But again, it would depend largely on what flavour of SQL, and what presentation language you are using.

FlyWhiteBoy
Jul 13, 2004

Scaramouche posted:

MySQL? MSSQL? The problem with SQL security is that:
- The majority of the problems are from the underlying presentation language (php, .net, etc.) each of which have their own quirks
- The SQL platforms by themselves are actually pretty secure, except when
- You write bad code and queries that allow them to be exploited (see point 1)

So a generic SQL Security book would be pretty hard to write, other than a couple pages of "how not to write injectable code". However, there are things specific to each platform, such as the permission escalation bug in older versions of MSSQL, that could stand to be documented specifically. But again, it would depend largely on what flavour of SQL, and what presentation language you are using.

I'm sorry I forgot to be more specific. I'm looking for a book that focuses on MS SQL Server security and auditing.

IAmKale
Jun 7, 2007

やらないか

Fun Shoe

BabyFur Denny posted:

Create a unique index (wordID, tagID) and do an INSERT IGNORE for any new tags. If the combination already exists in the database, it is silently dropped.
That worked perfectly, thanks :)

foundtomorrow
Feb 10, 2007
Ok guys, Google has failed me on this question. This is for a side project.
I need some help on automating a process with maybe Triggers or Stored Procedures that involves importing the contents of a .csv every two minutes (with likely duplicates from one .csv to the next) into a SQL Server database (2008 or 2012 is fine).

I am scraping a website which has data on the current status of multiple websites (latency, down, slow, etc.).
I have the scraping part all set up with Python using Selenium and BeautifulSoup and RegEx. This Python script exports to a Pipe-Delimited Text file once every 2 minutes. There's less than 100 rows on each 2 minute export, so we aren't talking huge data here, at least.
I can easily import my text file into my SQL Server Database table like this:
code:
BULK
INSERT CSVTest
FROM 'c:\testing\output1.txt'
WITH
(
FIELDTERMINATOR = '|',
ROWTERMINATOR = '\n'
)
GO
That works great. I also need to add a field which auto-populates on the current time when the row is inserted (I googled this, and I can do this as well).

This is the complicated part:
For each 2 minute update, there are both duplicate records which were imported previously, and also new records which need to be added.
One record can possibly be in every imported file for an entire hour (30 different records that were imported). This is fine, and I need this to happen, because I need to be able to calculate how long that record was actually showing on the website. In other words, if Record A shows up at 2PM, and then goes off the website at 3PM, I will have 30 entries in my Database table for that entry (one entry every 2 minutes). I need to be able to calculate that 1 hour time difference between the first entry and the last entry and then add that value to the first entry in a new field (Elapsed Time). There is another small complication. That exact same record could show up 30 minutes later, so I need to make sure my script knows that if the value was in the previous CSV import, but not in the current import, then that's when the Elapsed Time calculation should occur.

Sorry for the wall of text, but SQL Server Triggers or Stored Procedures are not familiar to me. I am comfortable with Python, Java, C#, but haven't done much SQL Server development.

I'm not looking for code to copy and paste, I just need some guidance on the workflow I should use and how I can detect when the imported record was in the previous update and not in the current update, so I know when to calculate the Elapsed Time.

foundtomorrow fucked around with this message at 02:14 on Sep 27, 2013

aBagorn
Aug 26, 2004
I have a where statement in a stored proc I'm writing (MS-SQL) that includes this:

SQL code:
dbA.dateCreated BETWEEN @pStartDate and @pEndDate
However, StartDate and EndDate could come in as null. I've never been great with handling null parameters in sql. What's the best way to go about?

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
In general, your best bet is going to be something like this:
Load the data into a staging table (this is what you're doing now)
Perform any necessary calculations or transformations, maybe along with putting things into "work" tables.
Store the results in a final table which is what you do your reporting and whatnot against.

A few questions:
How often do you need to see the results? You scrape in real-time (more or less), but do you also need the data in SQL to be updated in real-time, or does that only need to happen every few hours or once a day or whatever?
Is there any unique identifier either on the file itself or on the records in the file? In other words, is there a way to distinguish between loading the same file again versus a new file that just has all the same entries in it?

My general suggestion would be to have your staging table and your reporting table. Load the CSV into the staging table. Then 1) create new "active" records in the reporting table for every record in the staging table that does not have a corresponding "active" record in the reporting table. 2) update all "active" reporting records that have corresponding records in the staging table. 3) update all "active" reporting records to "inactive" if there is no corresponding staging record. You may also want to copy the records in the staging table to some sort of history or archive table, making sure to indicate (probably with a FK to a header table) which records were loaded from what file and when.

Triggers have their uses, but this probably isn't one of them.

If you haven't already found it, here's the T-SQL Reference for SQL 2008.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

aBagorn posted:

I have a where statement in a stored proc I'm writing (MS-SQL) that includes this:

SQL code:
dbA.dateCreated BETWEEN @pStartDate and @pEndDate
However, StartDate and EndDate could come in as null. I've never been great with handling null parameters in sql. What's the best way to go about?

SQL code:
dbA.dateCreated BETWEEN ISNULL(@pStartDate,dbA.dateCreated) and ISNULL(@pEndDate,dbA.dateCreated)
This assumes that it's OK if either or both are null, and that null means wide open.

aBagorn
Aug 26, 2004

Jethro posted:

SQL code:
dbA.dateCreated BETWEEN ISNULL(@pStartDate,dbA.dateCreated) and ISNULL(@pEndDate,dbA.dateCreated)
This assumes that it's OK if either or both are null, and that null means wide open.

Exactly what I was looking for!

I tried to shoehorn in a case statement and I wasn't grasping it syntactically

Salt Fish
Sep 11, 2003

Cybernetic Crumb
Is there any reason to use Elasticsearch over Sphinx if I want to improve fulltext searching on a mysql cluster? Seems like ES would require more work to integrate although some of the analytics could be nice to have. Thoughts?

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!
I need some advice on how to better secure a DB that for now I still want direct Internet access to. I understand that in the real world you never put your db on an internet-facing machine and expect not to get hacked, but we're pretty scrappy and have to deal with this for now unless there's a very quick way of setting up rerouting someones pgAdmin connection through a gateway box. Or even better expose the data through some kind of admin dashboard instead of connecting directly etc etc.

Right now we're using passwords and IP whitelisting, but the problem is that whitelisting sucks when you have to directly connect to the DB from a different location (which we do a lot), and now you have to ssh into that machine and update its pg_hba.conf and make sure the IP is recorded in our configuration management repo. What's a step up from this? Is there a simpler way of still ensuring a bare minimum of security without whitelisting?

Salt Fish
Sep 11, 2003

Cybernetic Crumb

DreadCthulhu posted:

I need some advice on how to better secure a DB that for now I still want direct Internet access to. I understand that in the real world you never put your db on an internet-facing machine and expect not to get hacked, but we're pretty scrappy and have to deal with this for now unless there's a very quick way of setting up rerouting someones pgAdmin connection through a gateway box. Or even better expose the data through some kind of admin dashboard instead of connecting directly etc etc.

Right now we're using passwords and IP whitelisting, but the problem is that whitelisting sucks when you have to directly connect to the DB from a different location (which we do a lot), and now you have to ssh into that machine and update its pg_hba.conf and make sure the IP is recorded in our configuration management repo. What's a step up from this? Is there a simpler way of still ensuring a bare minimum of security without whitelisting?

I would only allow access to the MySQL service on the local area network and then have my users VPN to the private network to connect to the service. Do you have a capable hardware firewall?

foundtomorrow
Feb 10, 2007

Jethro posted:

Great advice

A few questions:
How often do you need to see the results? You scrape in real-time (more or less), but do you also need the data in SQL to be updated in real-time, or does that only need to happen every few hours or once a day or whatever?
Is there any unique identifier either on the file itself or on the records in the file? In other words, is there a way to distinguish between loading the same file again versus a new file that just has all the same entries in it?

Great advice.

Thanks for the excellent help man.

I don't need to see results in SQL Server often at all, once a week would be fine. The SQL table only needs to be update once a day, I'd say.

Currently, there is no unique ID on the CSV that I create (no headers, just my scraped data (about 10 fields, 100 records max per CSV, every 2 minutes).
There would be no way to distinguish between two different CSVs. The way I have the Python set up right now, it actually deletes the old CSV and replaces it with new data every 2 minutes (assuming I am loading the CSV into SQL Server immediately).

The real kicker is that I kind of do need to load the CSV quickly into SQL Server in order to create that extra field which adds on exactly what time the record was added in SQL Server. I could always do that in Python though, if I needed to.


I will look into the option of a staging table and a final table, and check the final table for any records already existing and just add only the new records, and then calculate a time difference between the first record insertion vs. the repeat records, I like that idea.

DreadCthulhu
Sep 17, 2008

What the fuck is up, Denny's?!

Salt Fish posted:

I would only allow access to the MySQL service on the local area network and then have my users VPN to the private network to connect to the service. Do you have a capable hardware firewall?

Our "datacenter" is in AWS EC2 right now, so I'm using whatever firewall they provide at the network interface level when you generate instances. Is the VPN approach still sensible for AWS?

Adbot
ADBOT LOVES YOU

Xae
Jan 19, 2005

foundtomorrow posted:

Ok guys, Google has failed me on this question. This is for a side project.
I need some help on automating a process with maybe Triggers or Stored Procedures that involves importing the contents of a .csv every two minutes (with likely duplicates from one .csv to the next) into a SQL Server database (2008 or 2012 is fine).

I am scraping a website which has data on the current status of multiple websites (latency, down, slow, etc.).
I have the scraping part all set up with Python using Selenium and BeautifulSoup and RegEx. This Python script exports to a Pipe-Delimited Text file once every 2 minutes. There's less than 100 rows on each 2 minute export, so we aren't talking huge data here, at least.
I can easily import my text file into my SQL Server Database table like this:
code:
BULK
INSERT CSVTest
FROM 'c:\testing\output1.txt'
WITH
(
FIELDTERMINATOR = '|',
ROWTERMINATOR = '\n'
)
GO
That works great. I also need to add a field which auto-populates on the current time when the row is inserted (I googled this, and I can do this as well).

This is the complicated part:
For each 2 minute update, there are both duplicate records which were imported previously, and also new records which need to be added.
One record can possibly be in every imported file for an entire hour (30 different records that were imported). This is fine, and I need this to happen, because I need to be able to calculate how long that record was actually showing on the website. In other words, if Record A shows up at 2PM, and then goes off the website at 3PM, I will have 30 entries in my Database table for that entry (one entry every 2 minutes). I need to be able to calculate that 1 hour time difference between the first entry and the last entry and then add that value to the first entry in a new field (Elapsed Time). There is another small complication. That exact same record could show up 30 minutes later, so I need to make sure my script knows that if the value was in the previous CSV import, but not in the current import, then that's when the Elapsed Time calculation should occur.

Sorry for the wall of text, but SQL Server Triggers or Stored Procedures are not familiar to me. I am comfortable with Python, Java, C#, but haven't done much SQL Server development.

I'm not looking for code to copy and paste, I just need some guidance on the workflow I should use and how I can detect when the imported record was in the previous update and not in the current update, so I know when to calculate the Elapsed Time.

Do the items have a unique ID?

If So you could expand the schema to handle the slowly changing dimension.

code:
Create table Item_Master
(
Item_ID   int,
Item_Name nvarchar(200)
)

Create table Item_Availability
(
Item_ID             int,
Availability_Status nvarchar(1)--Y/N flag
Status_ts           datetime default getdate()
)
You then load the CSV to a temporary/staging table and process the records.
1) Check Item_Master if Item_ID exists, check for Item_Name update. If Item_ID does not exist add to Item_Master table
2) Insert Availability Status to Item_Availability

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply