SELECT * FROM Questions WHERE Type = 'Stupid'

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > SELECT * FROM Questions WHERE Type = 'Stupid'

«‹›303 »

Ranzear: Jul 25, 2013

It's definitely threads and buffer size due to awful defaults. Index stuff would never be that bad. If you have access or can ask for the config I can better tell you what's up.

# ? Apr 20, 2023 01:53

Adbot: ADBOT LOVES YOU

# ? Jun 10, 2024 10:58

Ytlaya: Nov 13, 2005

Ranzear posted:

It's definitely threads and buffer size due to awful defaults. Index stuff would never be that bad. If you have access or can ask for the config I can better tell you what's up.

I think this is the contents of the relevant config file (if there's a way to show this much text in a scrollable window or something to avoid taking up so much space, someone let me know):

code:

[mysqld]

# innodb_empty_free_list_algorithm=backoff
innodb_buffer_pool_size=16G
# innodb_ibuf_max_size=2G
innodb_ft_min_token_size=3
# innodb_use_sys_malloc=0
innodb_file_per_table=ON
key_buffer_size=10M

ft_min_word_len = 3

# main = 1 for active master: server A
gtid-domain-id=1

tmpdir=/export/tmp
wait_timeout=180
lc_time_names=en_US
lc_messages=en_US
max_connections=2048
thread_cache_size=16
open_files_limit = 16384

query_cache_type = 1
query_cache_min_res_unit = 1k
query_cache_limit = 5120M
query_cache_size=2000M

log_error=/var/log/mysql/error.log

#general_log=1
#general_log_file=/var/log/mysql/server.log
skip-name-resolve

# Skip recovery for now:
innodb_force_recovery=1

bind-address = 0.0.0.0
port = 3306

slow_query_log=1
slow_query_log_file=/var/log/mysql/mysql-slow.log
long_query_time=60.0
log_queries_not_using_indexes=0
log_warnings=4
log_slow_admin_statements=ON

ft_min_word_len=3

log-bin=/var/lib/mysql/gn0-binary-log
expire-logs-days=120
server-id=1
# Domain = 1 for active master: server A
gtid-domain-id=1

[myisamchk]
sort_buffer_size=4M
ft_min_word_len = 3

# ? Apr 21, 2023 19:10

Ranzear: Jul 25, 2013

Query cache is way too big and configured backwards. query_cache_size should be 256m tops, only then with extremely large but static data. I don't think I ever set it part 128m on absolute monster nodes. query_cache_limit is completely reversed too: That's the max size of an individual query to be stored in the cache, so should just be reasonably larger than some result set you'd expect to reuse (not even your largest). 1m is fine unless y'all need a smidge more, but it can be mere kilobytes for real benefit.

Don't quote me on the actual mechanisms going on, but imagine that writing and later if not immediately invalidating 5 gig at a time every statement might sound a little off yeah? Surprised it hasn't complained about the sizes being irreconcilable.

Should A/B test with reasonable values, but I'd just turn it off entirely until you're sure there's no indexing to fix too.

That it hasn't given a startup error of outright refusal gives me a sneaking suspicion this isn't the only config file in play. Verify everything with SHOW VARIABLES, especially the innodb values which otherwise look fine.

Ranzear fucked around with this message at 11:27 on Apr 23, 2023

# ? Apr 23, 2023 11:24

kiwid: Sep 30, 2013

I feel like this might be a really stupid question but when I see queries like this:

SQL code:

SELECT
    s1_name_and_address.full_name,
    0.000000 AS ingr,
    s1_uom_conv_10000_a.conv_factor / 10000.0e0 AS order_formula_conv_factor,
    s1_uom_conv_10000_b.conv_factor / 10000.0e0 AS formula_mixer_conv_factor,
    s1_uom_a.measurement_type_code,
FROM ...

What does the 0e0 mean in "s1_uom_conv_10000_a.conv_factor / 10000.0e0" part?

# ? Apr 25, 2023 21:09

Bruegels Fuckbooks: Sep 14, 2004; Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

kiwid posted:

What does the 0e0 mean in "s1_uom_conv_10000_a.conv_factor / 10000.0e0" part?

The number is in scientific notation (e meaning exponent). e.g. 10e0 is 10, 10e+1 is 100. Adding the e0 to the end of that literal makes it so the number is treated as a float instead of a decimal. Decimal and float behave differently in sql division.

# ? Apr 25, 2023 21:53

kiwid: Sep 30, 2013

Bruegels Fuckbooks posted:

The number is in scientific notation (e meaning exponent). e.g. 10e0 is 10, 10e+1 is 100. Adding the e0 to the end of that literal makes it so the number is treated as a float instead of a decimal. Decimal and float behave differently in sql division.

Great answer, thank you.

# ? Apr 25, 2023 22:27

Sauer: Sep 13, 2005; Socialize Everything!

I'm not a habitual database toucher, I just happen to be the company nerd ("you program CNC machines, I'm sure you can do SQL; whatever that is" - the management) and I'm working on an inventory report for the purchasing folks. The data is coming from a PostgreSQL database (Some ancient 9.3 version).

I'm extracting our on-hand inventory of various widgets which are spread out across various warehouses. Getting the quantity of each doohicky in the relevant warehouses is trivial but they'd prefer to see it in a columnar like form instead of row like and I've hit a snag trying to make that happen.

This is the query currently:

SQL code:

-- Find available part inventory by warehouse
select
    part_quantity.prt_id,
    part_quantity.whs_no,
    part_quantity.pqt_real_qty

from
    part_quantity

where
        part_quantity.whs_no in ('AAA', 'BBB', 'CCC', 'DDD')	-- Relevant warehouses
    and part_quantity.prt_id = 7313				-- Limit to one part for testing

Which gets us:

pre:

prt_id		whs_no		pqt_real_qty
7313		AAA		0.0
7313		BBB		0.0
7313		CCC		0.0
7313		DDD		401.0

They'd prefer to see it as:

pre:

prt_id		AAA	BBB	CCC	DDD
7313		0.0	0.0	0.0	401.0

I've already discovered the crosstab() function in the tablefunc extension which I imagine is designed to do exactly this but I honestly can't make heads or tails of its documentation and I wouldn't be surprised if there's a simpler way I'm just not aware of. Any hints other than telling my boss to send me to SQL school if he wants this sort of stuff? I've already suggested training but...

# ? May 1, 2023 17:52

nielsm: Jun 1, 2009

Untested, but I think it should look something like this:

SQL code:

SELECT *
FROM crosstab(
    'SELECT prt_id, whs_no, pqt_real_qty FROM part_quantity',
    'SELECT DISTINCT whs_no FROM part_quantity'
)

# ? May 1, 2023 19:30

Sauer: Sep 13, 2005; Socialize Everything!

I took a break on this project, machined some parts, and came back to reread the documentation for crosstab() and worked it out. First parameter of the function retrieves the data set and second parameter determines the categorization. The resulting query is kind of ugly, but does work:

SQL code:

select
    *

from crosstab(
$$
    select
        part_quantity.prt_id,
        part_quantity.whs_no,
        sum(part_quantity.pqt_real_qty)

    from
        part_quantity

    where
        part_quantity.whs_no in ('AAA', 'BBB', 'CCC', 'DDD')
    
    group by
        prt_id,
        whs_no
    
    order by
        prt_id,
        whs_no
$$,

$$ 
    select distinct 
        whs_no 
    
    from 
        part_quantity 
    
    where
        part_quantity.whs_no in ('AAA', 'BBB', 'CCC', 'DDD')
    
    order by
        whs_no
$$
) as ct (prt_id bigint, aaa numeric, bbb numeric, ccc numeric, ddd numeric)

Thanks for the hint, nielsm.

# ? May 1, 2023 20:30

raminasi: Jan 25, 2005; a last drink with no ice

CKinbote posted:

I'm struggling at work with a (third-party) application built around a database; I only ever interact with databases through an ORM, so I'd love a more informed opinion as to whether this thing is as dumb as it appears.

This application ingests and stores data from experiments run on various samples. Each sample has a unique ID. Each experiment has its own table in the database, with fields for the sample ID and various things measured in the experiment. The end user wants all the data in one big table, so a batch job joins all the experiment tables into one big display table.

Currently the batch job takes an hour to run. There are about 20 experiment tables, with a total of about 500 total measurement fields. There have been about 60,000 samples. There are a total of about 2 million measured values, so about 14 out of every 15 values in the final table are empty.

I have a hard time believing this is state of the art, but I don't know much about databases and am curious whether there is a deep reason for this design that I don't know about. I feel like storing the data in a "skinny" format with fields for just sample ID, field ID, and value (and then have other tables with details for each field, like name and originating experiment) could be the way to go, but I wonder if that would fail to scale for some reason I don't know about.

I don't know much about how this database is implemented (other than "in Java") or how the AWS instance that runs all this is provisioned, unfortunately.

I agree with previous responses that if the end user wants it this way, figure out why they do before you advocate for any changes, but I'll add that the "sparse table" style you're describing isn't a novel thing this project invented. It's one of a number of known solutions to this kind of problem. This isn't to say it's necessarily a good solution in this or any particular case, but it's not totally out of left field.

# ? May 1, 2023 20:42

Lurkmaster 5000: Mar 11, 2019

I am writing some code that will return a date based on if they are in a certain group or not. The first CASE statement does the job, but the second one only works for the first value in each IN statement. What's going on?

CASE
WHEN pts2.agegroup='2_19' or pts2.agegroup='5-19' or pts2.agegroup='12_19' then '03-01-2019'
WHEN pts2.agegroup='2_20' or pts2.agegroup='5-20' or pts2.agegroup='12_20' then '03-01-2020'
WHEN pts2.agegroup='2_21' or pts2.agegroup='5-21' or pts2.agegroup='12_21' then '03-01-2021'
WHEN pts2.agegroup='2_22' or pts2.agegroup='5-22' or pts2.agegroup='12_22' then '03-01-2022'
WHEN pts2.agegroup='2_23' or pts2.agegroup='5-23' or pts2.agegroup='12_23' then '03-01-2023'
END

CASE
WHEN pts2.agegroup IN ('2_19' ,'5-19' ,'12_19') then '03-01-2019'
WHEN pts2.agegroup IN ('2_20' ,'5-20' ,'12_20') then '03-01-2020'
WHEN pts2.agegroup IN ('2_21' ,'5-21' ,'12_21') then '03-01-2021'
WHEN pts2.agegroup IN ('2_22' ,'5-22' ,'12_22') then '03-01-2022'
WHEN pts2.agegroup IN ('2_23' ,'5-23' ,'12_23') then '03-01-2023'
END

# ? May 25, 2023 14:15

Moonwolf: Jun 29, 2004; Flee from th' terrifyin' evil of "NHS"!

Lurkmaster 5000 posted:

I am writing some code that will return a date based on if they are in a certain group or not. The first CASE statement does the job, but the second one only works for the first value in each IN statement. What's going on?

CASE
WHEN pts2.agegroup='2_19' or pts2.agegroup='5-19' or pts2.agegroup='12_19' then '03-01-2019'
WHEN pts2.agegroup='2_20' or pts2.agegroup='5-20' or pts2.agegroup='12_20' then '03-01-2020'
WHEN pts2.agegroup='2_21' or pts2.agegroup='5-21' or pts2.agegroup='12_21' then '03-01-2021'
WHEN pts2.agegroup='2_22' or pts2.agegroup='5-22' or pts2.agegroup='12_22' then '03-01-2022'
WHEN pts2.agegroup='2_23' or pts2.agegroup='5-23' or pts2.agegroup='12_23' then '03-01-2023'
END

CASE
WHEN pts2.agegroup IN ('2_19' ,'5-19' ,'12_19') then '03-01-2019'
WHEN pts2.agegroup IN ('2_20' ,'5-20' ,'12_20') then '03-01-2020'
WHEN pts2.agegroup IN ('2_21' ,'5-21' ,'12_21') then '03-01-2021'
WHEN pts2.agegroup IN ('2_22' ,'5-22' ,'12_22') then '03-01-2022'
WHEN pts2.agegroup IN ('2_23' ,'5-23' ,'12_23') then '03-01-2023'
END

Those should be the same, which SQL engine is this in? Although I've not heard of any having fucky CASE statements.

# ? May 25, 2023 20:16

abelwingnut: Dec 23, 2002

at least in sql server, i was always told the engine turns an IN statement into the same logic/pathway it uses for multiple OR. so the engine's turning the second one into the first one.

# ? May 25, 2023 21:04

Twerk from Home: Jan 17, 2009; This avatar brought to you by the 'save our dead gay forums' foundation.

Not really a SQL question, but I figure y'all have the lay of the land for analytics tools in here. Cross posting:

Twerk from Home posted:

I'm wanting a single-node simple text search for exploring datasets on the order of couple terabytes of text that's medium-compressible. I want to KISS because this fits on a single computer, but we still want chunking so that it's reasonably fast and efficient and uses several cores. Dataset sizes are around 1-5TB uncompressed and 200-300GB compressed.

We've got lots of these datasets and we don't need HA uptime and have no streaming ingest. What tools should I look at before gluing something together myself? Single node elasticsearch? DuckDB? Shell scripts calling split, zstd, ripgrep, cat and a dozen other unix tools?

# ? May 25, 2023 23:11

I would blow Dane Cook: Dec 26, 2008

https://www.youtube.com/watch?v=k4RRi_ntQc8

# ? May 29, 2023 03:34

nem: Jan 4, 2003; panel.dev
apnscp: cPanel evolved

On the Postgres time-series side with TimescaleDB, has there been much luck on reducing record size by cleaving the TIMESTAMP field to an INT while retaining native functions?

# ? May 29, 2023 05:01

Just-In-Timeberlake: Aug 18, 2003

what say you thread

# ? Jun 7, 2023 01:42

lazerwolf: Dec 22, 2009; Orange and Black

Just-In-Timeberlake posted:

what say you thread

Can SQL center a div?

# ? Jun 7, 2023 01:58

LightRailTycoon: Mar 24, 2017

lazerwolf posted:

Can SQL center a div?

The dialect I'm stuck with has a
REMOVE JAVA Statement

# ? Jun 7, 2023 02:16

I would blow Dane Cook: Dec 26, 2008

SELECT * FROM my
where passion = 'sql'

# ? Jun 16, 2023 14:31

Bruegels Fuckbooks: Sep 14, 2004; Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

I would blow Dane Cook posted:

SELECT * FROM my
where passion = 'sql'

ORA-01403 Error Message: �No Data Found�

# ? Jun 16, 2023 16:09

abelwingnut: Dec 23, 2002

so i've got a noodlescratcher here.

i am trying to build a dataset from our salesforce db detailing how our companies and their attributes and measures have changed monthly over the last two years. therefore, each company would have 24 rows, each representing a month. it's obviously easy to get the attributes and measures for our current month--i just pull the data from the accounts and users table in their current state. however, trying to get the previous 23 months for each company is not obvious to me right now.

what i have to work with are an accountshistory table and a contractshistory table. they're effectively log files. in each one, the structure is basically:

--

id, accountid, dateofchange, fieldchanged, oldvalue, newvalue

--

there's some other stuff in there too, but nothing that seems like it'd help get what i want. but yea, each row in each of these tables represents a single change to whatever account, then tells you exactly what field changed and when.

so, i have the data, i'm just not sure how to organize it, i suppose? like, how do i take these log files and produce something that'll match up the monthly structure of the dataset i want to create?

like, for instance, let's say account a's city was changed in april of 2022. it was new york before, but was changed to dc. i can see how i can match the data up so that, in my target dataset, i have the city field for account a's april 2022 row with the city = 'dc'. but how would i then fill that for may 2022? june 2022? and so forth until whenever the next change occurred or we arrive at the current day without a change?

i guess that's my first big question. i think i have others, but maybe whatever answer we come up with will help with those. maybe there's some clever over (partition by � ) logic to make this work in the target dataset? or do i need to create a temp table or cte or something before i even get to the target which effectively rearranges the data to:

...
february | 2022 | city | new york
march | 2022 | city | new york
april | 2022 | city | dc
may | 2022 | city | dc
...

and just have that happen for all fields? how would i even start with something like that?

i don't know--i'm kind of at a loss here. and some basic googling hasn't really given me much direction. probably not searching for the right terms, but it's also hard to phrase for a search.

any help would be greatly appreciated!

abelwingnut fucked around with this message at 06:12 on Jun 19, 2023

# ? Jun 19, 2023 05:53

nielsm: Jun 1, 2009

For that kind of log, your best bet is probably to iteratively replay it. Order the log entries by time and "execute" them one by one. It probably is possible to express with an SQL query in some way, but it's going to be difficult to write and read, and likely not be any faster either.

If you have the option to do this in a different language (like C# or Python or whatever) that might be less painful.

# ? Jun 19, 2023 06:16

TheFluff: Dec 13, 2006; FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

abelwingnut posted:

so i've got a noodlescratcher here.

That schema kinda sucks but you can transform it to what you need with some effort. If a report asks for one row per month regardless of whether there is any data for that month or not, the correct answer is usually a cross join on generate_series(start_date, end_date, '1 months'::interval) and if you're not on postgres you get to lament your lack of generate_series() and write some ridiculous recursive CTE instead.

Here's kind of a sketch of something kinda like what I think you want, almost certainly not the easiest or most efficient way to do it but it was fun to write:

SQL code:

with accounthistory_ranges as (
  select
    account_id,
    field_changed,
    changed_at as valid_from,
    lead(changed_at, 1, 'infinity') over (
      partition by account_id, field_changed
      order by accounthistory.changed_at
    ) as valid_to,
    new_value as val
  from accounthistory
)
select
  months.first_day,
  most_recent_value_per_month.*
from
  generate_series(
    '2022-10-01T00:00:00.000Z'::timestamptz,
    date_trunc('month', now()),
    '1 months'::interval
  ) as months (first_day)
  cross join lateral (
    select *
    from accounthistory_ranges
    where
      (accounthistory_ranges.valid_from, accounthistory_ranges.valid_to) overlaps
        (months.first_day, months.first_day + '1 months'::interval)
    order by accounthistory_ranges.valid_to desc
    limit 1
  ) as most_recent_value_per_month;

This is very much not complete, it works for the toy example but I don't think it's a complete solution. As mentioned it's a sketch that you may be able to build from; here's a fiddle if you wanna mess around.

(I assumed there can be more than one change per field per month but that you only want the most recent one.)

TheFluff fucked around with this message at 15:52 on Jun 19, 2023

# ? Jun 19, 2023 15:36

Cold on a Cob: Feb 6, 2006; i've seen so much, i'm going blind
and i'm brain dead virtually; College Slice

If you're writing t-sql you can generate a series using string_split() if you are running SQL Server 2016 or later, or generate_series() if you are running SQL Server 2022.

# ? Jun 19, 2023 15:44

TheFluff: Dec 13, 2006; FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Cold on a Cob posted:

or generate_series() if you are running SQL Server 2022.

oh they added it? heck yeah :getin:

# ? Jun 19, 2023 15:45

abelwingnut: Dec 23, 2002

oh boy, thank you both!

nielsm posted:

For that kind of log, your best bet is probably to iteratively replay it. Order the log entries by time and "execute" them one by one. It probably is possible to express with an SQL query in some way, but it's going to be difficult to write and read, and likely not be any faster either.

If you have the option to do this in a different language (like C# or Python or whatever) that might be less painful.

i am quite bad at all thing not sql so i'd like to keep any solution within my wheelhouse. it would probably be easier for me to create some elaborate sql solution than a non-sql one, i think. thanks, though! that said, i'm not sure how you would 'execute' them like that? again, python and such are my enemy.

TheFluff posted:

That schema kinda sucks...

yes it does! salesforce's database is a goddamn mess. literally thousands of tables and completely unwieldy.

and fortunately i am writing this on redshift, which is new to me and based on postgres. lead? lateral? overlaps? i'll check them out. on top of that, this generate_series() is wild! i actually had a numbers table solution in my head that seemed kind of elaborate, but this should easily replace that.

i'll play around with this and see where it goes. as of now, though, i have a framework of a solution in my head. it goes:

a) create the structure of the dataset. each row for each month of each company. i actually already did this with a:

SQL code:

WITH RECURSIVE cteCalendar (d) AS (
	SELECT
		TO_DATE('20210601', 'YYYYMMDD') AS d
	UNION ALL
	SELECT
		CAST(DATEADD(month, 1, d) AS date) AS d
	FROM
		cteCalendar
	WHERE
		CAST(DATEADD(month, 2, d) AS date) < GETDATE()
)
SELECT
	blah
FROM
	accounts a
	JOIN cteCalendar cc
		ON 1 = 1

i suppose i could replace that with generate_series() above. any benefit there?

b) join accounthistory to this on company, YEAR(datechange), and MONTH(datechange). i'd then use newvalue to give me the correct value for the row for the month of change.

c) join accounthistory again on company, YEAR(DATEADD(month, -1, datechange)), and MONTH(DATEADD(month, -1, datechange)). i'd then use oldvalue to give me the correct value for the row for month preceding the change.

d) then do something like this.

SQL code:

	SELECT
		ISNULL(Val, (SELECT TOP 1 Val FROM dataset WHERE ID < t.ID AND Val IS NOT NULL ORDER BY ID DESC))
	FROM
		dataset

or something along those lines. effectively, i'd be filling out the NULL with previous rows. i haven't thought through all of this in detail, but that's where my head was at before reading your post.

going to play around with it in a few hours and see where it goes.

thank you, both, again!

# ? Jun 19, 2023 16:02

TheFluff: Dec 13, 2006; FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

abelwingnut posted:

i suppose i could replace that with generate_series() above. any benefit there?

The recursive CTE is how you'd do it if you didn't have generate_series(); the main benefit of using generate_series() instead is readability (possibly it also performs better but I haven't looked at the plans).

I'm not 100% sure I understand your plan but the general idea seems to be basically replicating the effect of a window function. You're looking for values from the next or previous rows within a group (where the group in this case is rows with the same account_id and field_changed values). If you look at my query, the CTE at the top calculates the valid_to timestamp by using the lead() window function to take all the rows with the same account_id, field_changed values as the current row, sort them by changed_at, and then use the changed_at value from the row that comes after the currently evaluated one. If there is no next row, it uses infinity instead.

The whole cross join lateral nonsense is just me being lazy, there's probably a more efficient way to extract the most recent change out of each monthly group, but doing it this way just lets me put an arbitrary subquery that references fields from the outer query right in the from clause.

As for overlaps, it's a range operator. It's syntactic sugar and you can write an equivalent expression with a couple of comparisons, but it's very convenient because it covers a bunch of obnoxious edge cases (for example the range open and close values don't have to be ordered, that is if you have a range (start, end) it still works if start > end).

My sketch probably looks more complex than it is; the CTE just adds a valid_to column to the accounthistory (by looking at the next row in each group), and then the main query joins one row per month (from generate_series()) with the most recent row that would have been valid during that month. You'd run into trouble once you get more than one account and field though because you'd need to modify the join so you get one row per field per account per month, but hopefully it gave you some ideas.

TheFluff fucked around with this message at 16:46 on Jun 19, 2023

# ? Jun 19, 2023 16:39

abelwingnut: Dec 23, 2002

well, the query seems to be progressing, at least. i've managed to get the lag setup working and i'll figure out how to attach that to the main dataset in the next day or two.

but, things have gotten stupid on the aws side of things. apparently sagemaker's data wrangler cannot read from redshift serverless, which is where i'm doing all of this. so somehow i have to get all this nonsense out of athena instead. hopefully i can somehow create a view or, at worst, table based on the query i'm writing, have athena access that, then connect data wrangler that way.

god, i loving hate aws. nothing is easy with it. absolutely nothing.

# ? Jun 20, 2023 05:11

sausage king of Chicago: Jun 13, 2001

So we have a MS SQL db that holds some credit card data and other personal information, like card holder name, address, etc. There is a weekly refresh process that takes the prod data and copies it down to our QA db. In order to stay PCI compliant, certain fields in the QA db should be anonymized, like card info, cardholder name, etc. We would like the anonymized data to be consistent, so like if there is a cardholder_id column in one table, that id is the same in all tables that reference it, same with cardholder_name, etc.

I've been googling and everyone says this is a pretty hard problem to solve. Does anyone have experience with any solutions that do this for you, preferably something that is free or not very expensive? Otherwise I was thinking after the refresh happens, to write some scripts that create a temp table with the ids and then anonymized data for the other fields, then updating the real tables with the anonymized data matched by the ids? I feel like this solution seems pretty simple, so I'm not sure if I'm missing something.

# ? Jun 21, 2023 04:59

DELETE CASCADE: Oct 25, 2017; i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003

just leave the ids alone and run the other fields thru a secure hash function to anonymize?

# ? Jun 22, 2023 00:30

mortarr: Apr 28, 2005; frozen meat at high speed

If I have a table in an MS SQL database with a bunch of rows already, then I add a column like [ID] [int] IDENTITY(1,1) NOT NULL, what controls the order in which the rows are populated?

Is there some kind of underlying structure that would cause the rows to be populated sequentially in order of when they were inserted, or is it random?

mortarr fucked around with this message at 01:03 on Jul 10, 2023

# ? Jul 10, 2023 01:01

Just-In-Timeberlake: Aug 18, 2003

I've always started with an identity column on a blank table and it's always gone sequentially, but I've never added an identity column to an existing table with data already in it but if I had to guess it'd be the same.

The quick answer is create a test table, dump some data into it, add an identity column and see what happens.

# ? Jul 10, 2023 14:25

I would blow Dane Cook: Dec 26, 2008

Would it go on the clustered index?

# ? Jul 10, 2023 15:22

Cold on a Cob: Feb 6, 2006; i've seen so much, i'm going blind
and i'm brain dead virtually; College Slice

mortarr posted:

If I have a table in an MS SQL database with a bunch of rows already, then I add a column like [ID] [int] IDENTITY(1,1) NOT NULL, what controls the order in which the rows are populated?

Is there some kind of underlying structure that would cause the rows to be populated sequentially in order of when they were inserted, or is it random?

The only thing I know for sure that might help is SQL Server tables without a clustered index are a heap table so there is no underlying secret clustered index that would preserve the insertion order. My guess is it would just create the ids based on the order of the data pages in the heap, which is definitely not in inserted order.

Edit:
Some more to read here, it seems to imply the order is indeterminate: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/heaps-tables-without-clustered-indexes?view=sql-server-ver16

quote:

Creating or dropping clustered indexes requires rewriting the entire table. If the table has nonclustered indexes, all the nonclustered indexes must all be recreated whenever the clustered index is changed. Therefore, changing from a heap to a clustered index structure or back can take a lot of time and require disk space for reordering data in tempdb.

In the past I've seen recommendations that if you do this and have another column you want to order on (eg a created_date column), you can add a clustered index on the created_date column, create ID column with identity and seed it based on the order, drop the clustered index and alter the ID column to be the new index. Never tried it though.

Cold on a Cob fucked around with this message at 16:05 on Jul 10, 2023

# ? Jul 10, 2023 16:01

mortarr: Apr 28, 2005; frozen meat at high speed

Thanks re IDENTITY in MS SQL chat, it confirms my suspicions that I need to use some other columns if I want to populate it in a specific order.

# ? Jul 10, 2023 23:42

KOTEX GOD OF BLOOD: Jul 7, 2012

I am trying to load a huge postgresql database (which I did not create) onto my local machine, an M1 ARM Mac, to play around with it. I am having one gently caress of a time even getting Postgres to function.

I've decided to use Postgres.app, which advertises itself as "the easiest way to get started with PostgreSQL on the Mac." Considering I can't get started, that is a bad sign.

I started by creating a user.

code:

createuser -U postgres -s $USER

Now it's time to create a database to put the data into.

code:

psql --command="create database database_name"
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  database "KOTEX" does not exist

Cool, off to a great start.

The troubleshooting page suggests I run createdb $USER.

code:

psql --command="createdb $USER"
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  database "KOTEX" does not exist

I have no idea where to go from here :/

# ? Jul 21, 2023 19:15

lazerwolf: Dec 22, 2009; Orange and Black

KOTEX GOD OF BLOOD posted:

I am trying to load a huge postgresql database (which I did not create) onto my local machine, an M1 ARM Mac, to play around with it. I am having one gently caress of a time even getting Postgres to function.

I've decided to use Postgres.app, which advertises itself as "the easiest way to get started with PostgreSQL on the Mac." Considering I can't get started, that is a bad sign.

I started by creating a user.
code:
createuser -U postgres -s $USER
Now it's time to create a database to put the data into.
code:
psql --command="create database database_name"
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  database "KOTEX" does not exist
Cool, off to a great start.

The troubleshooting page suggests I run createdb $USER.
code:
psql --command="createdb $USER"
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  database "KOTEX" does not exist
I have no idea where to go from here :/

Seems like the �KOTEX� database you�re trying to create users does not exist.

For Postgres.app you should be able to open a GUI and see what databases currently exist.

# ? Jul 21, 2023 19:39

DELETE CASCADE: Oct 25, 2017; i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003

createdb is a command-line program provided by the app. it is not a psql command

# ? Jul 22, 2023 05:58

Adbot: ADBOT LOVES YOU

# ? Jun 10, 2024 10:58

credburn: Jun 22, 2016; President, Founder of the Brent Spiner Fan Club

Can someone tell me if I'm on the right path?

Here is the ER diagram with which I'm working:

My task is to come up with a business question and create a report using a lot of SQL. This is my second SQL course so I kind of have an idea of what to do; I was just wondering if someone could look over my idea and tell me if this is a Stupid Idea or if it's even really a feasible thing to do with what is available.

My idea for this Business Question is thus: Marketing wants to find out what actors are most popular in the districts where the most movies are rented.

In order to do that, I need to first find out what districts are most common. That should be easy. But then I need to find out what actors most frequently appear in the films being rented by those districts. That's where, conceptually, I get lost. I feel like this is the sort of mundane and typical task that will become second nature in a career setting, but right now I find I'm overwhelmed.

# ? Jul 25, 2023 19:56

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > SELECT * FROM Questions WHERE Type = 'Stupid'

«‹›303 »