Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Ranzear
Jul 25, 2013

It's definitely threads and buffer size due to awful defaults. Index stuff would never be that bad. If you have access or can ask for the config I can better tell you what's up.

Adbot
ADBOT LOVES YOU

Ytlaya
Nov 13, 2005

Ranzear posted:

It's definitely threads and buffer size due to awful defaults. Index stuff would never be that bad. If you have access or can ask for the config I can better tell you what's up.

I think this is the contents of the relevant config file (if there's a way to show this much text in a scrollable window or something to avoid taking up so much space, someone let me know):

code:
[mysqld]

# innodb_empty_free_list_algorithm=backoff
innodb_buffer_pool_size=16G
# innodb_ibuf_max_size=2G
innodb_ft_min_token_size=3
# innodb_use_sys_malloc=0
innodb_file_per_table=ON
key_buffer_size=10M

ft_min_word_len = 3

# main = 1 for active master: server A
gtid-domain-id=1

tmpdir=/export/tmp
wait_timeout=180
lc_time_names=en_US
lc_messages=en_US
max_connections=2048
thread_cache_size=16
open_files_limit = 16384

query_cache_type = 1
query_cache_min_res_unit = 1k
query_cache_limit = 5120M
query_cache_size=2000M

log_error=/var/log/mysql/error.log

#general_log=1
#general_log_file=/var/log/mysql/server.log
skip-name-resolve

# Skip recovery for now:
innodb_force_recovery=1

bind-address = 0.0.0.0
port = 3306

slow_query_log=1
slow_query_log_file=/var/log/mysql/mysql-slow.log
long_query_time=60.0
log_queries_not_using_indexes=0
log_warnings=4
log_slow_admin_statements=ON

ft_min_word_len=3

log-bin=/var/lib/mysql/gn0-binary-log
expire-logs-days=120
server-id=1
# Domain = 1 for active master: server A
gtid-domain-id=1

[myisamchk]
sort_buffer_size=4M
ft_min_word_len = 3

Ranzear
Jul 25, 2013

Query cache is way too big and configured backwards. query_cache_size should be 256m tops, only then with extremely large but static data. I don't think I ever set it part 128m on absolute monster nodes. query_cache_limit is completely reversed too: That's the max size of an individual query to be stored in the cache, so should just be reasonably larger than some result set you'd expect to reuse (not even your largest). 1m is fine unless y'all need a smidge more, but it can be mere kilobytes for real benefit.

Don't quote me on the actual mechanisms going on, but imagine that writing and later if not immediately invalidating 5 gig at a time every statement might sound a little off yeah? Surprised it hasn't complained about the sizes being irreconcilable.

Should A/B test with reasonable values, but I'd just turn it off entirely until you're sure there's no indexing to fix too.

That it hasn't given a startup error of outright refusal gives me a sneaking suspicion this isn't the only config file in play. Verify everything with SHOW VARIABLES, especially the innodb values which otherwise look fine.

Ranzear fucked around with this message at 11:27 on Apr 23, 2023

kiwid
Sep 30, 2013

I feel like this might be a really stupid question but when I see queries like this:

SQL code:
SELECT
    s1_name_and_address.full_name,
    0.000000 AS ingr,
    s1_uom_conv_10000_a.conv_factor / 10000.0e0 AS order_formula_conv_factor,
    s1_uom_conv_10000_b.conv_factor / 10000.0e0 AS formula_mixer_conv_factor,
    s1_uom_a.measurement_type_code,
FROM ...
What does the 0e0 mean in "s1_uom_conv_10000_a.conv_factor / 10000.0e0" part?

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

kiwid posted:

What does the 0e0 mean in "s1_uom_conv_10000_a.conv_factor / 10000.0e0" part?

The number is in scientific notation (e meaning exponent). e.g. 10e0 is 10, 10e+1 is 100. Adding the e0 to the end of that literal makes it so the number is treated as a float instead of a decimal. Decimal and float behave differently in sql division.

kiwid
Sep 30, 2013

Bruegels Fuckbooks posted:

The number is in scientific notation (e meaning exponent). e.g. 10e0 is 10, 10e+1 is 100. Adding the e0 to the end of that literal makes it so the number is treated as a float instead of a decimal. Decimal and float behave differently in sql division.

Great answer, thank you.

Sauer
Sep 13, 2005

Socialize Everything!
I'm not a habitual database toucher, I just happen to be the company nerd ("you program CNC machines, I'm sure you can do SQL; whatever that is" - the management) and I'm working on an inventory report for the purchasing folks. The data is coming from a PostgreSQL database (Some ancient 9.3 version).

I'm extracting our on-hand inventory of various widgets which are spread out across various warehouses. Getting the quantity of each doohicky in the relevant warehouses is trivial but they'd prefer to see it in a columnar like form instead of row like and I've hit a snag trying to make that happen.

This is the query currently:
SQL code:
-- Find available part inventory by warehouse
select
    part_quantity.prt_id,
    part_quantity.whs_no,
    part_quantity.pqt_real_qty

from
    part_quantity

where
        part_quantity.whs_no in ('AAA', 'BBB', 'CCC', 'DDD')	-- Relevant warehouses
    and part_quantity.prt_id = 7313				-- Limit to one part for testing
Which gets us:
pre:
prt_id		whs_no		pqt_real_qty
7313		AAA		0.0
7313		BBB		0.0
7313		CCC		0.0
7313		DDD		401.0
They'd prefer to see it as:
pre:
prt_id		AAA	BBB	CCC	DDD
7313		0.0	0.0	0.0	401.0
I've already discovered the crosstab() function in the tablefunc extension which I imagine is designed to do exactly this but I honestly can't make heads or tails of its documentation and I wouldn't be surprised if there's a simpler way I'm just not aware of. Any hints other than telling my boss to send me to SQL school if he wants this sort of stuff? I've already suggested training but...

nielsm
Jun 1, 2009



Untested, but I think it should look something like this:

SQL code:
SELECT *
FROM crosstab(
    'SELECT prt_id, whs_no, pqt_real_qty FROM part_quantity',
    'SELECT DISTINCT whs_no FROM part_quantity'
)

Sauer
Sep 13, 2005

Socialize Everything!
I took a break on this project, machined some parts, and came back to reread the documentation for crosstab() and worked it out. First parameter of the function retrieves the data set and second parameter determines the categorization. The resulting query is kind of ugly, but does work:

SQL code:
select
    *

from crosstab(
$$
    select
        part_quantity.prt_id,
        part_quantity.whs_no,
        sum(part_quantity.pqt_real_qty)

    from
        part_quantity

    where
        part_quantity.whs_no in ('AAA', 'BBB', 'CCC', 'DDD')
    
    group by
        prt_id,
        whs_no
    
    order by
        prt_id,
        whs_no
$$,

$$ 
    select distinct 
        whs_no 
    
    from 
        part_quantity 
    
    where
        part_quantity.whs_no in ('AAA', 'BBB', 'CCC', 'DDD')
    
    order by
        whs_no
$$
) as ct (prt_id bigint, aaa numeric, bbb numeric, ccc numeric, ddd numeric)
Thanks for the hint, nielsm.

raminasi
Jan 25, 2005

a last drink with no ice

CKinbote posted:

I'm struggling at work with a (third-party) application built around a database; I only ever interact with databases through an ORM, so I'd love a more informed opinion as to whether this thing is as dumb as it appears.

This application ingests and stores data from experiments run on various samples. Each sample has a unique ID. Each experiment has its own table in the database, with fields for the sample ID and various things measured in the experiment. The end user wants all the data in one big table, so a batch job joins all the experiment tables into one big display table.

Currently the batch job takes an hour to run. There are about 20 experiment tables, with a total of about 500 total measurement fields. There have been about 60,000 samples. There are a total of about 2 million measured values, so about 14 out of every 15 values in the final table are empty.

I have a hard time believing this is state of the art, but I don't know much about databases and am curious whether there is a deep reason for this design that I don't know about. I feel like storing the data in a "skinny" format with fields for just sample ID, field ID, and value (and then have other tables with details for each field, like name and originating experiment) could be the way to go, but I wonder if that would fail to scale for some reason I don't know about.

I don't know much about how this database is implemented (other than "in Java") or how the AWS instance that runs all this is provisioned, unfortunately.

I agree with previous responses that if the end user wants it this way, figure out why they do before you advocate for any changes, but I'll add that the "sparse table" style you're describing isn't a novel thing this project invented. It's one of a number of known solutions to this kind of problem. This isn't to say it's necessarily a good solution in this or any particular case, but it's not totally out of left field.

Lurkmaster 5000
Mar 11, 2019
I am writing some code that will return a date based on if they are in a certain group or not. The first CASE statement does the job, but the second one only works for the first value in each IN statement. What's going on?

CASE
WHEN pts2.agegroup='2_19' or pts2.agegroup='5-19' or pts2.agegroup='12_19' then '03-01-2019'
WHEN pts2.agegroup='2_20' or pts2.agegroup='5-20' or pts2.agegroup='12_20' then '03-01-2020'
WHEN pts2.agegroup='2_21' or pts2.agegroup='5-21' or pts2.agegroup='12_21' then '03-01-2021'
WHEN pts2.agegroup='2_22' or pts2.agegroup='5-22' or pts2.agegroup='12_22' then '03-01-2022'
WHEN pts2.agegroup='2_23' or pts2.agegroup='5-23' or pts2.agegroup='12_23' then '03-01-2023'
END

CASE
WHEN pts2.agegroup IN ('2_19' ,'5-19' ,'12_19') then '03-01-2019'
WHEN pts2.agegroup IN ('2_20' ,'5-20' ,'12_20') then '03-01-2020'
WHEN pts2.agegroup IN ('2_21' ,'5-21' ,'12_21') then '03-01-2021'
WHEN pts2.agegroup IN ('2_22' ,'5-22' ,'12_22') then '03-01-2022'
WHEN pts2.agegroup IN ('2_23' ,'5-23' ,'12_23') then '03-01-2023'
END

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


Lurkmaster 5000 posted:

I am writing some code that will return a date based on if they are in a certain group or not. The first CASE statement does the job, but the second one only works for the first value in each IN statement. What's going on?

CASE
WHEN pts2.agegroup='2_19' or pts2.agegroup='5-19' or pts2.agegroup='12_19' then '03-01-2019'
WHEN pts2.agegroup='2_20' or pts2.agegroup='5-20' or pts2.agegroup='12_20' then '03-01-2020'
WHEN pts2.agegroup='2_21' or pts2.agegroup='5-21' or pts2.agegroup='12_21' then '03-01-2021'
WHEN pts2.agegroup='2_22' or pts2.agegroup='5-22' or pts2.agegroup='12_22' then '03-01-2022'
WHEN pts2.agegroup='2_23' or pts2.agegroup='5-23' or pts2.agegroup='12_23' then '03-01-2023'
END

CASE
WHEN pts2.agegroup IN ('2_19' ,'5-19' ,'12_19') then '03-01-2019'
WHEN pts2.agegroup IN ('2_20' ,'5-20' ,'12_20') then '03-01-2020'
WHEN pts2.agegroup IN ('2_21' ,'5-21' ,'12_21') then '03-01-2021'
WHEN pts2.agegroup IN ('2_22' ,'5-22' ,'12_22') then '03-01-2022'
WHEN pts2.agegroup IN ('2_23' ,'5-23' ,'12_23') then '03-01-2023'
END

Those should be the same, which SQL engine is this in? Although I've not heard of any having fucky CASE statements.

abelwingnut
Dec 23, 2002


at least in sql server, i was always told the engine turns an IN statement into the same logic/pathway it uses for multiple OR. so the engine's turning the second one into the first one.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Not really a SQL question, but I figure y'all have the lay of the land for analytics tools in here. Cross posting:

Twerk from Home posted:

I'm wanting a single-node simple text search for exploring datasets on the order of couple terabytes of text that's medium-compressible. I want to KISS because this fits on a single computer, but we still want chunking so that it's reasonably fast and efficient and uses several cores. Dataset sizes are around 1-5TB uncompressed and 200-300GB compressed.

We've got lots of these datasets and we don't need HA uptime and have no streaming ingest. What tools should I look at before gluing something together myself? Single node elasticsearch? DuckDB? Shell scripts calling split, zstd, ripgrep, cat and a dozen other unix tools?

I would blow Dane Cook
Dec 26, 2008
https://www.youtube.com/watch?v=k4RRi_ntQc8

nem
Jan 4, 2003

panel.dev
apnscp: cPanel evolved
On the Postgres time-series side with TimescaleDB, has there been much luck on reducing record size by cleaving the TIMESTAMP field to an INT while retaining native functions?

Just-In-Timeberlake
Aug 18, 2003
what say you thread

lazerwolf
Dec 22, 2009

Orange and Black

Just-In-Timeberlake posted:

what say you thread



Can SQL center a div?

LightRailTycoon
Mar 24, 2017

lazerwolf posted:

Can SQL center a div?
The dialect I'm stuck with has a
REMOVE JAVA Statement

I would blow Dane Cook
Dec 26, 2008
SELECT * FROM my
where passion = 'sql'

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

I would blow Dane Cook posted:

SELECT * FROM my
where passion = 'sql'

ORA-01403 Error Message: “No Data Found”

abelwingnut
Dec 23, 2002


so i've got a noodlescratcher here.

i am trying to build a dataset from our salesforce db detailing how our companies and their attributes and measures have changed monthly over the last two years. therefore, each company would have 24 rows, each representing a month. it's obviously easy to get the attributes and measures for our current month--i just pull the data from the accounts and users table in their current state. however, trying to get the previous 23 months for each company is not obvious to me right now.

what i have to work with are an accountshistory table and a contractshistory table. they're effectively log files. in each one, the structure is basically:

--

id, accountid, dateofchange, fieldchanged, oldvalue, newvalue

--

there's some other stuff in there too, but nothing that seems like it'd help get what i want. but yea, each row in each of these tables represents a single change to whatever account, then tells you exactly what field changed and when.

so, i have the data, i'm just not sure how to organize it, i suppose? like, how do i take these log files and produce something that'll match up the monthly structure of the dataset i want to create?

like, for instance, let's say account a's city was changed in april of 2022. it was new york before, but was changed to dc. i can see how i can match the data up so that, in my target dataset, i have the city field for account a's april 2022 row with the city = 'dc'. but how would i then fill that for may 2022? june 2022? and so forth until whenever the next change occurred or we arrive at the current day without a change?

i guess that's my first big question. i think i have others, but maybe whatever answer we come up with will help with those. maybe there's some clever over (partition by … ) logic to make this work in the target dataset? or do i need to create a temp table or cte or something before i even get to the target which effectively rearranges the data to:

...
february | 2022 | city | new york
march | 2022 | city | new york
april | 2022 | city | dc
may | 2022 | city | dc
...

and just have that happen for all fields? how would i even start with something like that?

i don't know--i'm kind of at a loss here. and some basic googling hasn't really given me much direction. probably not searching for the right terms, but it's also hard to phrase for a search.

any help would be greatly appreciated!

abelwingnut fucked around with this message at 06:12 on Jun 19, 2023

nielsm
Jun 1, 2009



For that kind of log, your best bet is probably to iteratively replay it. Order the log entries by time and "execute" them one by one. It probably is possible to express with an SQL query in some way, but it's going to be difficult to write and read, and likely not be any faster either.

If you have the option to do this in a different language (like C# or Python or whatever) that might be less painful.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

abelwingnut posted:

so i've got a noodlescratcher here.

That schema kinda sucks but you can transform it to what you need with some effort. If a report asks for one row per month regardless of whether there is any data for that month or not, the correct answer is usually a cross join on generate_series(start_date, end_date, '1 months'::interval) and if you're not on postgres you get to lament your lack of generate_series() and write some ridiculous recursive CTE instead.

Here's kind of a sketch of something kinda like what I think you want, almost certainly not the easiest or most efficient way to do it but it was fun to write:

SQL code:
with accounthistory_ranges as (
  select
    account_id,
    field_changed,
    changed_at as valid_from,
    lead(changed_at, 1, 'infinity') over (
      partition by account_id, field_changed
      order by accounthistory.changed_at
    ) as valid_to,
    new_value as val
  from accounthistory
)
select
  months.first_day,
  most_recent_value_per_month.*
from
  generate_series(
    '2022-10-01T00:00:00.000Z'::timestamptz,
    date_trunc('month', now()),
    '1 months'::interval
  ) as months (first_day)
  cross join lateral (
    select *
    from accounthistory_ranges
    where
      (accounthistory_ranges.valid_from, accounthistory_ranges.valid_to) overlaps
        (months.first_day, months.first_day + '1 months'::interval)
    order by accounthistory_ranges.valid_to desc
    limit 1
  ) as most_recent_value_per_month;
This is very much not complete, it works for the toy example but I don't think it's a complete solution. As mentioned it's a sketch that you may be able to build from; here's a fiddle if you wanna mess around.

(I assumed there can be more than one change per field per month but that you only want the most recent one.)

TheFluff fucked around with this message at 15:52 on Jun 19, 2023

Cold on a Cob
Feb 6, 2006

i've seen so much, i'm going blind
and i'm brain dead virtually

College Slice
If you're writing t-sql you can generate a series using string_split() if you are running SQL Server 2016 or later, or generate_series() if you are running SQL Server 2022.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Cold on a Cob posted:

or generate_series() if you are running SQL Server 2022.

oh they added it? heck yeah :getin:

abelwingnut
Dec 23, 2002


oh boy, thank you both!

nielsm posted:

For that kind of log, your best bet is probably to iteratively replay it. Order the log entries by time and "execute" them one by one. It probably is possible to express with an SQL query in some way, but it's going to be difficult to write and read, and likely not be any faster either.

If you have the option to do this in a different language (like C# or Python or whatever) that might be less painful.

i am quite bad at all thing not sql so i'd like to keep any solution within my wheelhouse. it would probably be easier for me to create some elaborate sql solution than a non-sql one, i think. thanks, though! that said, i'm not sure how you would 'execute' them like that? again, python and such are my enemy.


TheFluff posted:

That schema kinda sucks...

yes it does! salesforce's database is a goddamn mess. literally thousands of tables and completely unwieldy.

and fortunately i am writing this on redshift, which is new to me and based on postgres. lead? lateral? overlaps? i'll check them out. on top of that, this generate_series() is wild! i actually had a numbers table solution in my head that seemed kind of elaborate, but this should easily replace that.

i'll play around with this and see where it goes. as of now, though, i have a framework of a solution in my head. it goes:

a) create the structure of the dataset. each row for each month of each company. i actually already did this with a:

SQL code:
WITH RECURSIVE cteCalendar (d) AS (
	SELECT
		TO_DATE('20210601', 'YYYYMMDD') AS d
	UNION ALL
	SELECT
		CAST(DATEADD(month, 1, d) AS date) AS d
	FROM
		cteCalendar
	WHERE
		CAST(DATEADD(month, 2, d) AS date) < GETDATE()
)
SELECT
	blah
FROM
	accounts a
	JOIN cteCalendar cc
		ON 1 = 1
i suppose i could replace that with generate_series() above. any benefit there?

b) join accounthistory to this on company, YEAR(datechange), and MONTH(datechange). i'd then use newvalue to give me the correct value for the row for the month of change.

c) join accounthistory again on company, YEAR(DATEADD(month, -1, datechange)), and MONTH(DATEADD(month, -1, datechange)). i'd then use oldvalue to give me the correct value for the row for month preceding the change.

d) then do something like this.

SQL code:
	SELECT
		ISNULL(Val, (SELECT TOP 1 Val FROM dataset WHERE ID < t.ID AND Val IS NOT NULL ORDER BY ID DESC))
	FROM
		dataset
or something along those lines. effectively, i'd be filling out the NULL with previous rows. i haven't thought through all of this in detail, but that's where my head was at before reading your post.

going to play around with it in a few hours and see where it goes.

thank you, both, again!

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

abelwingnut posted:

i suppose i could replace that with generate_series() above. any benefit there?
The recursive CTE is how you'd do it if you didn't have generate_series(); the main benefit of using generate_series() instead is readability (possibly it also performs better but I haven't looked at the plans).

I'm not 100% sure I understand your plan but the general idea seems to be basically replicating the effect of a window function. You're looking for values from the next or previous rows within a group (where the group in this case is rows with the same account_id and field_changed values). If you look at my query, the CTE at the top calculates the valid_to timestamp by using the lead() window function to take all the rows with the same account_id, field_changed values as the current row, sort them by changed_at, and then use the changed_at value from the row that comes after the currently evaluated one. If there is no next row, it uses infinity instead.

The whole cross join lateral nonsense is just me being lazy, there's probably a more efficient way to extract the most recent change out of each monthly group, but doing it this way just lets me put an arbitrary subquery that references fields from the outer query right in the from clause.

As for overlaps, it's a range operator. It's syntactic sugar and you can write an equivalent expression with a couple of comparisons, but it's very convenient because it covers a bunch of obnoxious edge cases (for example the range open and close values don't have to be ordered, that is if you have a range (start, end) it still works if start > end).

My sketch probably looks more complex than it is; the CTE just adds a valid_to column to the accounthistory (by looking at the next row in each group), and then the main query joins one row per month (from generate_series()) with the most recent row that would have been valid during that month. You'd run into trouble once you get more than one account and field though because you'd need to modify the join so you get one row per field per account per month, but hopefully it gave you some ideas.

TheFluff fucked around with this message at 16:46 on Jun 19, 2023

abelwingnut
Dec 23, 2002


well, the query seems to be progressing, at least. i've managed to get the lag setup working and i'll figure out how to attach that to the main dataset in the next day or two.

but, things have gotten stupid on the aws side of things. apparently sagemaker's data wrangler cannot read from redshift serverless, which is where i'm doing all of this. so somehow i have to get all this nonsense out of athena instead. hopefully i can somehow create a view or, at worst, table based on the query i'm writing, have athena access that, then connect data wrangler that way.

god, i loving hate aws. nothing is easy with it. absolutely nothing.

sausage king of Chicago
Jun 13, 2001
So we have a MS SQL db that holds some credit card data and other personal information, like card holder name, address, etc. There is a weekly refresh process that takes the prod data and copies it down to our QA db. In order to stay PCI compliant, certain fields in the QA db should be anonymized, like card info, cardholder name, etc. We would like the anonymized data to be consistent, so like if there is a cardholder_id column in one table, that id is the same in all tables that reference it, same with cardholder_name, etc.

I've been googling and everyone says this is a pretty hard problem to solve. Does anyone have experience with any solutions that do this for you, preferably something that is free or not very expensive? Otherwise I was thinking after the refresh happens, to write some scripts that create a temp table with the ids and then anonymized data for the other fields, then updating the real tables with the anonymized data matched by the ids? I feel like this solution seems pretty simple, so I'm not sure if I'm missing something.

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
just leave the ids alone and run the other fields thru a secure hash function to anonymize?

mortarr
Apr 28, 2005

frozen meat at high speed
If I have a table in an MS SQL database with a bunch of rows already, then I add a column like [ID] [int] IDENTITY(1,1) NOT NULL, what controls the order in which the rows are populated?

Is there some kind of underlying structure that would cause the rows to be populated sequentially in order of when they were inserted, or is it random?

mortarr fucked around with this message at 01:03 on Jul 10, 2023

Just-In-Timeberlake
Aug 18, 2003
I've always started with an identity column on a blank table and it's always gone sequentially, but I've never added an identity column to an existing table with data already in it but if I had to guess it'd be the same.

The quick answer is create a test table, dump some data into it, add an identity column and see what happens.

I would blow Dane Cook
Dec 26, 2008
Would it go on the clustered index?

Cold on a Cob
Feb 6, 2006

i've seen so much, i'm going blind
and i'm brain dead virtually

College Slice

mortarr posted:

If I have a table in an MS SQL database with a bunch of rows already, then I add a column like [ID] [int] IDENTITY(1,1) NOT NULL, what controls the order in which the rows are populated?

Is there some kind of underlying structure that would cause the rows to be populated sequentially in order of when they were inserted, or is it random?

The only thing I know for sure that might help is SQL Server tables without a clustered index are a heap table so there is no underlying secret clustered index that would preserve the insertion order. My guess is it would just create the ids based on the order of the data pages in the heap, which is definitely not in inserted order.

Edit:
Some more to read here, it seems to imply the order is indeterminate: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/heaps-tables-without-clustered-indexes?view=sql-server-ver16

quote:

Creating or dropping clustered indexes requires rewriting the entire table. If the table has nonclustered indexes, all the nonclustered indexes must all be recreated whenever the clustered index is changed. Therefore, changing from a heap to a clustered index structure or back can take a lot of time and require disk space for reordering data in tempdb.

In the past I've seen recommendations that if you do this and have another column you want to order on (eg a created_date column), you can add a clustered index on the created_date column, create ID column with identity and seed it based on the order, drop the clustered index and alter the ID column to be the new index. Never tried it though.

Cold on a Cob fucked around with this message at 16:05 on Jul 10, 2023

mortarr
Apr 28, 2005

frozen meat at high speed
Thanks re IDENTITY in MS SQL chat, it confirms my suspicions that I need to use some other columns if I want to populate it in a specific order.

KOTEX GOD OF BLOOD
Jul 7, 2012

I am trying to load a huge postgresql database (which I did not create) onto my local machine, an M1 ARM Mac, to play around with it. I am having one gently caress of a time even getting Postgres to function.

I've decided to use Postgres.app, which advertises itself as "the easiest way to get started with PostgreSQL on the Mac." Considering I can't get started, that is a bad sign.

I started by creating a user.

code:
createuser -U postgres -s $USER
Now it's time to create a database to put the data into.

code:
psql --command="create database database_name"
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  database "KOTEX" does not exist
Cool, off to a great start.

The troubleshooting page suggests I run createdb $USER.

code:
psql --command="createdb $USER"
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  database "KOTEX" does not exist
I have no idea where to go from here :/

lazerwolf
Dec 22, 2009

Orange and Black

KOTEX GOD OF BLOOD posted:

I am trying to load a huge postgresql database (which I did not create) onto my local machine, an M1 ARM Mac, to play around with it. I am having one gently caress of a time even getting Postgres to function.

I've decided to use Postgres.app, which advertises itself as "the easiest way to get started with PostgreSQL on the Mac." Considering I can't get started, that is a bad sign.

I started by creating a user.

code:
createuser -U postgres -s $USER
Now it's time to create a database to put the data into.

code:
psql --command="create database database_name"
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  database "KOTEX" does not exist
Cool, off to a great start.

The troubleshooting page suggests I run createdb $USER.

code:
psql --command="createdb $USER"
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  database "KOTEX" does not exist
I have no idea where to go from here :/

Seems like the “KOTEX” database you’re trying to create users does not exist.

For Postgres.app you should be able to open a GUI and see what databases currently exist.

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
createdb is a command-line program provided by the app. it is not a psql command

Adbot
ADBOT LOVES YOU

credburn
Jun 22, 2016
A tangled skein of bad opinions, the hottest takes, and the the world's most misinformed nonsense. Do not engage with me, it's useless, and better yet, put me on ignore.
Can someone tell me if I'm on the right path?

Here is the ER diagram with which I'm working:



My task is to come up with a business question and create a report using a lot of SQL. This is my second SQL course so I kind of have an idea of what to do; I was just wondering if someone could look over my idea and tell me if this is a Stupid Idea or if it's even really a feasible thing to do with what is available.

My idea for this Business Question is thus: Marketing wants to find out what actors are most popular in the districts where the most movies are rented.

In order to do that, I need to first find out what districts are most common. That should be easy. But then I need to find out what actors most frequently appear in the films being rented by those districts. That's where, conceptually, I get lost. I feel like this is the sort of mundane and typical task that will become second nature in a career setting, but right now I find I'm overwhelmed.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply