SELECT * FROM Questions WHERE Type = 'Stupid'

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > SELECT * FROM Questions WHERE Type = 'Stupid'

mortarr: Apr 28, 2005; frozen meat at high speed

Sql server text search question:

I've got two tables, one with text data (concatenated from a bunch of fields from other sources) that I want to rank, and the other with keywords and flag to indicate if presence of the keyword (or its inflectional forms?) is good or bad:

code:

Elements table example:

Id | Content
-------------
1  | Spurrier, Jo, author. Spurrier, Jo, author. Publication Date(s) 2014 Format Books Sci-Fi and Fantasy-Fiction Zone North star Fantasy fiction.
2  | Britain, Kristen, author. Britain, Kristen, author. Publication Date(s) 2014 Format Books Summary Sci-Fi and Fantasy-Fiction Zone Mirror sight Messengers -- 
     Fiction. Magicians -- Fiction. Good and evil -- Fiction. Magic -- Fiction. Teenage girls -- Fiction.
3  | Anderson, Taylor, 1963- Anderson, Taylor, 1963- Publication Date(s) 2014 Format Books Sci-Fi and Fantasy-Fiction Zone Deadly shores Imaginary wars and battles -- 
     Fiction. World War, 1939-1945 -- Fiction. Destroyers (Warships) -- Fiction.
4  | Franc�s, Victoria. Franc&eacute;s, Victoria. Publication Date(s) 2008 Format 3-D Object Jigsaws-Fiction Zone Butterfly [game (jigsaw)] Jigsaw puzzles.
5  | Hewitt, Kate author. Hewitt, Kate author. Publication Date(s) 2014 Format Books Romantic Writing-Romances-Fiction Zone A queen for the taking? 
6  | Smart, Michelle, author. Smart, Michelle, author. Publication Date(s) 2014 Format Books Romantic Writing-Romances-Fiction Zone What a sicilian husband wants


KeywordData table example:

RankModifier | Keyword
-----------------------
-1           | Large Print
-1           | Romantic Writing
1            | Jigsaws
1            | Magic

What I want to do is run a FREETEXTTABLE query across Elements once for each row in KeywordData and combine the FREETEXTTABLE results with the raw data somehow ordered by the sum of Modifier to cause the results that have the most positive keywords bubble to the top and the most negative sink to the bottom.

I can see how to do this with a loop inside a proc with queries like those below, but is there any way to do this in a single view using the KeywordData table to feed the FREETEXTTABLE search text parameter and the factor in the Modifier column? Does what I'm trying to do even seem worthwhile, or is there a better approach? I'm pretty new to this whole sql text search thing.

code:

  SELECT
    ElementId,
    Data,
    FullTextData,
    CAST(r.RANK as DECIMAL)/(SELECT MAX(RANK) FROM FREETEXTTABLE(dbo.[Elements], [FullTextData], 'romantic', 1)) * -1 AS Modifier

  FROM
    FREETEXTTABLE(dbo.[Elements], [FullTextData], 'romantic') r
    INNER JOIN dbo.[Elements] e
    ON r.[KEY] = e.[ElementId]  
  ORDER BY
    [Rank] DESC


  SELECT
    ElementId,
    Data,
    FullTextData,
    CAST(r.RANK as DECIMAL)/(SELECT MAX(RANK) FROM FREETEXTTABLE(dbo.[Elements], [FullTextData], 'Jigsaws', 1)) * 1 AS Modifier

  FROM
    FREETEXTTABLE(dbo.[Elements], [FullTextData], 'Jigsaws') r
    INNER JOIN dbo.[Elements] e
    ON r.[KEY] = e.[ElementId]  
  ORDER BY
    [Rank] DESC

# ¿ May 14, 2014 01:08

Adbot: ADBOT LOVES YOU

# ¿ May 6, 2024 06:16

mortarr: Apr 28, 2005; frozen meat at high speed

uguu posted:

Those are good points.
I considered InfluxDB, then I could just throw everything into a bucket and wash my hands of it. But as little as I know of sql, i know nothing of flux and the manual wasn't very helpful.
I will think about this a bunch more.
Thansk for the advice!

Not sure if you made any progress on this, but I did a similar thing at my old work - we had about 30 years worth of 3-waters related scada measurements for various sites around the city at 5min intervals. I don't work there anymore and it was a couple of years ago, but basically we had a bunch of ETL processes that collated the data from the various sources and ultimately inserted into postgres/timescaledb, then we used PostgREST to surface a proc using timescaledb's hyperfunctions that would pull the detail data plus a couple of other procs that could be used to surface summary data to help finding the devices of interest.

The data looked like: device id/location/time/value/comment, so not wide, just shitloads of it.

We surfaced PostgREST through an api gateway that let us limit access via api keys, rate limit, have nice looking url paths and all that good stuff. It took a bit of thought to put it all together, as we were new to postgres and as well as timescaledb, but once it was set up it was pretty good, I really liked the idea of PostgREST and not needing to roll a custom api/db interface.

# ¿ Sep 12, 2022 23:48

mortarr: Apr 28, 2005; frozen meat at high speed

KOTEX GOD OF BLOOD posted:

I want to have a list of "these are the Scotches where more than 65% of their sales (by dollar amount) contain 'Bar' or 'Lounge.'"

I like header rows just in case there are null combinations in the data, it's not necessary here as you didn't ask for a breakdown by Purpose for those that made the 65%. Anyway, here we are - probably overkill for your question! I think you can do this with windowed functions in one statement/no subqueries as well.

code:

;with src AS (
  SELECT 'Laphroaig' AS Scotch, 5000.0 AS SalesAmt, 'Bar' AS Purpose
  UNION SELECT 'Lagavulin', 6000, 'Lounge'
  UNION SELECT 'Glenlivet', 2000, 'Home'
  UNION SELECT 'Glenmorangie', 1500, 'Lounge'
  UNION SELECT 'Lagavulin', 1000, 'Lounge'
  UNION SELECT 'Lagavulin', 9000, 'Bar'
  UNION SELECT 'Glenlivet', 8000, 'Bar'
  UNION SELECT 'Glenlivet', 7500, 'Lounge'
  UNION SELECT 'Glenmorangie', 6000, 'Home'
  UNION SELECT 'Glenmorangie', 2000, 'Bar'
  UNION SELECT 'Laphroaig', 3500, 'Lounge'

), detail AS (  
  SELECT
  	Scotch,
    Purpose,
    SUM(SalesAmt) AS SumOfSalesAmt
  FROM
    src
  GROUP BY
    Scotch,
    Purpose

), salesByScotch AS (
  SELECT
  	Scotch,
    SUM(SalesAmt) AS SumOfSalesAmt,
    SUM(CASE WHEN Purpose IN ('Bar','Lounge') THEN SalesAmt ELSE 0 END) AS SumOfSalesAmtBarOrLounge
  FROM
    src
  GROUP BY
    Scotch
  
), headers_Scotch AS (
  SELECT DISTINCT 
  	Scotch
  FROM 
    src
)

SELECT
  hs.Scotch,
  sbs.SumOfSalesAmt,
  COALESCE(sbs.SumOfSalesAmt, 0) * 0.65 AS [65% SumOfSalesAmt],
  sbs.SumOfSalesAmtBarOrLounge  
FROM
  headers_Scotch hs
  INNER JOIN salesByScotch sbs
  ON hs.Scotch = sbs.Scotch
WHERE
  sbs.SumOfSalesAmtBarOrLounge > COALESCE(sbs.SumOfSalesAmt, 0) * 0.65
ORDER BY
  hs.Scotch

# ¿ Oct 5, 2022 03:56

mortarr: Apr 28, 2005; frozen meat at high speed

Oysters Autobio posted:

This isn't really a coding question but not sure about any good resources because of all the enterprise software click bait and I don't know if there's like a dataops / data engineering megathread here.

I'm being asked to help provide input for metadata governance/ Master Data Management stuff but the only stuff I can find online are ads for enterprise software. Any good actual resources for best practices? Checked Reddit and only advice o got out of it was to run which I'm slowly realizing is pretty good advice

I'd be keen on this too if anyone has any info - I've looked into training courses etc but all I've found locally are post-grad papers as part of a masters in records and data management.

# ¿ Oct 6, 2022 20:38

mortarr: Apr 28, 2005; frozen meat at high speed

I would blow Dane Cook posted:

Weird question, but in your work as a database administrator have you ever been confronted with an ethical quandary or asked to do something with data that you weren't comfortable with/felt appropriate and how did you handle it?

Not a dba either, but 20 or so years ago I was responsible for a medical report that was included in a "discharge summary" report - a report sent to a patients' GP post inpatient stay. The report included a summarised set of medical tests, including test name, date, value, and normal/expected range. I was responsible for re-writing/modernising the report, basically turning it from raw data into a pivoted set to save pages when faxed (lol), during the re-write I found a logic error transposing data for a small subset of tests, for all patients since the report was first written - going back multiple years. The resulting error would show both false positives and false negatives, depending which data was transposed.

I thought it was a huge deal, my boss thought we should just fix it and do nothing, I was just starting out, so not all that confident but I was thinking due to the scale of the issue I might need to go to the media or something as I wasn't getting any traction on it within my team. I don't recall how it happened now but I ended up speaking to the medical director of the hospital and he explained that the specific tests that were likely to be problematic had a shelf life - they were relevant only for around two or three months once administered, and also if they were abnormal there would have been a text note included in the report, abnormal tests would have triggered a further round of tests that would have been reported differently - so the risk of having non-obvious wrong data on a patents' permanent record was acceptable.

I still think about the whole thing, I don't know if the outcome was right but ultimately I didn't know enough about the medico-legal side then and had to trust that any fallback would land on the director at that stage - as his instructions to me were clear. I think I would do things differently if the same scenario happened to me now, in particular I would get a lot more in writing, but also my technical and business/people skills are way better so I would be able to investigate and advocate my position more effectively.

I really did come close to going to the media, and I think the only thing that stopped me was that I recognised doing that would have most likely tanked my career just as I was just starting out.

I've come across a few ethically dodgy situations at work since but nothing anywhere near the scope of this one, and none to do with data I've been responsible for. I think going through this was a good learning experience - being very careful with sensitive data is certainly something I factor into my work practice now.

# ¿ Feb 12, 2023 23:29

mortarr: Apr 28, 2005; frozen meat at high speed

double nine posted:

I'm struggling with a beginner SQL question...

I threw your whole post into chat gpt as-is and it came up with a pretty similar answer to hammerites' 2nd one (sql comments are mine):

code:

SELECT *
FROM purchases_last_12_months
WHERE ProductID = '123456'
AND PurchaseDate >= '20220101'
AND CustomerID NOT IN (
    -- ignore customers that have purchased product B in the last 12 months
    SELECT CustomerID
    FROM purchases_last_12_months
    WHERE ProductID = '987654'
)
UNION
SELECT *
FROM older_purchases
WHERE ProductID = '123456'
AND PurchaseDate >= '20220101'
AND CustomerID NOT IN (
    -- ignore customers that have purchased product B *prior* to the last 12 months
    SELECT CustomerID
    FROM older_purchases
    WHERE ProductID = '987654'
)

Dunno if I'd recommend it for advanced analytic queries, but it seems pretty good for getting a starting point if you're already fairly proficient at sql. I tried to add my own preferred style, which is building stuff up by ctes, but it's getting blocked by sa's ddos filter or something sorry... here's my pseudo code for it:

code:

/*
probably not performant on large datasets:

with cte purchases for all cols from (purchases_last_12_months unioned with older_purchases) where ProductID = '123456' AND PurchaseDate >= '20220101'
then with cte customers_to_ignore as CustomerID from (purchases_last_12_months unioned with older_purchases) where ProductID = '987654'
get all data from purchases where CustomerID not in customers_to_ignore 
*/

mortarr fucked around with this message at 22:29 on Feb 16, 2023

# ¿ Feb 16, 2023 22:21

mortarr: Apr 28, 2005; frozen meat at high speed

Does the the NOT IN vs NOT EXISTS anti-join query-plan forcing apply to MS SQL as well as postrgres, and is it always true that NOT IN is sub-optimal vs NOT EXISTS, or is it just true for larger sets?

I've always done NOT IN for a values list and also for tables or ctes with small numbers of values (like 100 or 1000 max maybe?) but swap over to NOT EXISTS for larger sets than that.

Jabor posted:

That chatgpt answer is really cool because it looks plausible and probably works on a couple of test cases but is also completely wrong and doesn't meet the specification at all.

Please do not use chatgpt to answer questions, it does not actually know the answers.

Sorry about that chatgpt answer - not sure why when I look at it now, but it really did look legit at the time. 20-some years writing sql doesn't count for much when you're in a rush to post, I guess

mortarr fucked around with this message at 21:49 on Feb 19, 2023

# ¿ Feb 19, 2023 21:40

mortarr: Apr 28, 2005; frozen meat at high speed

If I have a table in an MS SQL database with a bunch of rows already, then I add a column like [ID] [int] IDENTITY(1,1) NOT NULL, what controls the order in which the rows are populated?

Is there some kind of underlying structure that would cause the rows to be populated sequentially in order of when they were inserted, or is it random?

mortarr fucked around with this message at 01:03 on Jul 10, 2023

# ¿ Jul 10, 2023 01:01

mortarr: Apr 28, 2005; frozen meat at high speed

Thanks re IDENTITY in MS SQL chat, it confirms my suspicions that I need to use some other columns if I want to populate it in a specific order.

# ¿ Jul 10, 2023 23:42

mortarr: Apr 28, 2005; frozen meat at high speed

A lot of Data Engineering jobs mention python - where should I start? I've done a lot of data related dev stuff, almost all on c#/ms SQL if that helps? Looking at one of the Azure certs mentioned above too.

# ¿ Aug 1, 2023 09:57

mortarr: Apr 28, 2005; frozen meat at high speed

GigaPeon posted:

My takeaway so far is that I want Redgate�s toys and gizmos but my boss probably won�t buy them.

This thing is hosted by Redgate though so I�m probably being brainwashed.

Also there�s so much AI poo poo. How are people just cool with hooking up their environments to the internet and sending it to a third party?

Tell me more about the latest redgates gizmos - I used to use their sql toolbelt in another life, and their dependency sniffing tool was a legit lifesaver for the end-of-lifed products I was extracting data from at the time, but I never really got into much else of theirs.

# ¿ Nov 19, 2023 18:37

mortarr: Apr 28, 2005; frozen meat at high speed

Thanks all, interesting to see the inclusion of chatgpt although I guess it was inevitable.

kumba posted:

i have successfully convinced my team to start all of their where clauses with

WHERE 1=1

and sqlprompt is very good at making things like this easier

I thought I was the only person who did the 1=1 thing. So handy.

# ¿ Nov 21, 2023 23:53

mortarr: Apr 28, 2005; frozen meat at high speed

redleader posted:

yeah, when comparing datetimeoffsets, sql server normalises them all into utc before comparing them. in fact, casting [dst].[last_synced_at] to datetimeoffset with AT TIME ZONE 'UTC' might actually not be needed at all. i think that implicitly (or explicitly) casting a datetime to datetimeoffset will set the offset to +00:00, so just converting the EST datetime to datetimeoffset should work. but i'd definitely do some googling and comment the heck out of that if i went that route

I just want to say I have done all this converting from datetime to datetimeoffset and feel your pain. I was the timezone "expert" for a couple of companies in the past, and I hope I never have to dick about with timezones ever again, they are by far the most complex and error-prone kind of calculations I've ever had to work with.

# ¿ Mar 10, 2024 22:58

Adbot: ADBOT LOVES YOU

# ¿ May 6, 2024 06:16

mortarr: Apr 28, 2005; frozen meat at high speed

RandomBlue posted:

I loving hate working with dates and times because there's always some new way poo poo gets hosed up. There are a shitload of formats even if timezones aren't an issues but they always are. Storing it in UTC is the easiest part.

If you control the input and can limit things there it's not so bad but I work on a product that works with user provided CSVs, spreadsheets, DB sources, CRMs, etc.. and of course we work with international customers so all that poo poo has to work together seamlessly.

I had a fun one where all dates were initially NZST, then for a year or three were a mix of NZST and NZDT with no marker which was which, so there were gaps when setting the time forward and double-ups when setting back, and then it went back to NZST. The incoming data was from scada and IoT, so the raw source data was long gone by the time I found the fault, and there was a lot - 1000's of devices, with some posting new info at 5min intervals.

I ended up adding columns with raw timezone, UTC conversion and a flag to highlight the faulty data - it was only over the daylight savings crossovers that it was an issue, but there was pretty much nothing we could do so long after the fact.

# ¿ Mar 11, 2024 02:02

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > SELECT * FROM Questions WHERE Type = 'Stupid'