Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Thanks Ants
May 21, 2004

#essereFerrari


tehinternet posted:

I think I might have said it before, but if any of you all have companies looking for new revenue cycle software, for the love of god do not even consider Athena IDX.

Incompetent staff that love to hear their own voice
Best practices from 1997
"Private cloud" just means that they host the program on their trash servers that require VPN to access
If I wrote out all of their inane non-sensical bullshit it would legit sound like I was making it up or embellishing it, they loving suck, avoid at all costs.

I would honest to god rather deal with Citrix bullshit than their late 90's garbage

it's so bad

so bad

A couple years back I did some consulting work for a law firm who were changing their case management software, they spent the whole sales process explaining how they had no on-prem services, everything was Azure AD and Office 365, endpoints managed in Intune etc. Anyway the first step of deploying this new software was to create a VPN tunnel to establish domain trust, and that's why they now spend several thousand per year on Azure VMs.

Adbot
ADBOT LOVES YOU

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Agrikk posted:

I need some help.

I have two collections of PDF files: one is 171GB and 11,400 files, the other is 660GB and 50,000 files.

I would like to merge the collections into a single one and eliminate the duplicates.

The issue is that both collections have "reorganized" the contents so the directory tree isn't the same. In addition, each collection has copy/paste problems so there might be duplicates within themselves. To muddy the waters still further, ther might be multiples in each collection but are named slightly different, with slightly different file sizes. So "[publisher][id] - filename.pdf" might be a 7k byte file and "filename.publisher.id" might be 6k but both will have the same content and differ only in scan resolution/OCR/optimization.

What I am imagining is some way to create a spreadsheet of each collection somehow and then do a LIKE funtion somehow to find and list similarly named files from both collections. Even cooler would be to identify the similar files and group them together so we can then pick the one we want and discard the rest.

I'm thinking maybe use robocopy or some other utility to create a massive file list of each collection and combine them to sort by name, but that doesn't solve the "kinda identical name" problem.

It's been a while since I've done any useful scripting but I'd reach for Python + Fuzzy Matching. Post this over in our Python thread and you'll likely get some good ideas to jumpstart you.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

Hughmoris posted:

It's been a while since I've done any useful scripting but I'd reach for Python + Fuzzy Matching. Post this over in our Python thread and you'll likely get some good ideas to jumpstart you.

Ooh! Ooh! Good idea!

The Fool
Oct 16, 2003


Powershell has -like and -contains comparison operators that could help with that.

Super-NintendoUser
Jan 16, 2004

COWABUNGERDER COMPADRES
Soiled Meat

Agrikk posted:

I need some help.

I have two collections of PDF files: one is 171GB and 11,400 files, the other is 660GB and 50,000 files.

I would like to merge the collections into a single one and eliminate the duplicates.

The issue is that both collections have "reorganized" the contents so the directory tree isn't the same. In addition, each collection has copy/paste problems so there might be duplicates within themselves. To muddy the waters still further, ther might be multiples in each collection but are named slightly different, with slightly different file sizes. So "[publisher][id] - filename.pdf" might be a 7k byte file and "filename.publisher.id" might be 6k but both will have the same content and differ only in scan resolution/OCR/optimization.

What I am imagining is some way to create a spreadsheet of each collection somehow and then do a LIKE funtion somehow to find and list similarly named files from both collections. Even cooler would be to identify the similar files and group them together so we can then pick the one we want and discard the rest.

I'm thinking maybe use robocopy or some other utility to create a massive file list of each collection and combine them to sort by name, but that doesn't solve the "kinda identical name" problem.

Just to make sure I'm clear, there's duplicates between bundle A and bundle B, but not between two documents in A or B themselves?

I'd start with the easiest set of matches, get all you can sorted into group C, and remove them from group A and B. Then see how many you have left, and find another match that has a good set, and then keep wittling it down until you have a couple hundred of the hardest and then do them manually. I'm going to brute force it a bit with some inefficient sorts, but I think that this won't really take that long. First make a back up of both of them since it'll be destructive.

Put them on a linux server. Them make sure there's no duplicates inside each list We can do that with "fdupes":
https://github.com/adrianlopezroche/fdupes

Once that's done, we can do some really bad code to programmatically compare the two directories. We want to do the following:
1) if a file exists in A, but not in B, leave it
2) if a file exists in B but not in A, leave it
3) if I file exists in A and B, assume the copy in file A is authoritative, then delete the instance in directory B.


Step 1) do a complete list of all the files in A and B and do ms5sums

code:
cd /path/a
find -type f -print | list_of_a.output 
while read line; do md5sum "${line}"; done < list_of_a.output  | md5sum_a.output
cd /path/b
find -type f -print | list_of_b.output 
while read line; do md5sum "${line}"; done < list_of_b.output  | md5sum_b.output
Step 2) process all matches
Now you have two files, each with a complete list of all the files and all md5sums. This means that if any of the sums match, the files are the same. So now we need a goofy way to find out. Get ready for some bad code. You can run this with "./script.sh" if you run it as "./script.sh delete" it will attempt to delete (I left --dry-run and fake 'rm' so it wont' actually delete. Basically it takes the lists above, looks at each line in list A, and compares the hash against each has in list B. There's three conditions:
1) the file matches A and B -> moves copy in A to a new path, and deletes it from B
2) file in A is unique -> moves it to new path

When it's done, you have three things:
1) location A left alone
2) location B is purged of matching files against location A

code:
#!/bin/bash

readonly list_a="md5sum_a.output"
readonly list_b="md5sum_b.output"
readonly function=$1


delete_from_b() {
# this will remove it from B, since it's already present in A
# this is sort of janky code to prompt, but pass n so I can test
if [[ "${function}" == "delete" ]];
then
	yes n |rm -i "${path_b:3}"
else
	ls -l "${path_b:3}"
fi
}

main() {
# load each item of line A, make it an array, and then compare it to each sum in line B
num_a="0"
num_b="0"
while read line;
do
	# increment by one to sort of give some idea of progress
	let "num_a++"
	# returns the hash
	hash_a=`echo -n "$line"| awk '{print $1}'`
	# returns the full path
	path_a=`echo -n "$line"| awk '{$1=""}1'`
	echo "Processing file $num_a path=${path_a} hash=${hash_a}"
	# really bad comparision attempt, should be fun
	while read line;
		do
		        # incremember by one to sort of give some idea of progress
			let "num_b++"
			# returns the hash
        		hash_b=`echo -n "$line"| awk '{print $1}'`
		        # returns the full path
		        path_b=`echo -n "$line"| awk '{$1=""}1'`
		        echo "Comparing to file $num_b path=${path_b} hash=${hash_b}"
			if [[ "${hash_a}" == "${hash_b}" ]];
				then
					echo "files match"
					delete_from_b
			else
				echo "files don't match"
			fi
		# since there's no internal matches, there's no need to continue
		break
		done < ${list_b}
done < ${list_a}
}

main
At this point, figure out another matching criteria, and re-compare. It'll be super hard to look at PDFs that are OCR. Probably the best option is to scrape the text (or maybe the first hundred characters), and make file named _CONTENTS_actual_file.pdf, and then do another pass (compare contents of each of the _contents_ files) and use that as a baseline. Then if the first hundred match, write a second loop that digs deeper and looks at the first 1000 characters, if that matches it compares all the text.

Anything deeper with fuzzy logic (like OCP makes "lamp" "Iamp" and "1amp") and you'll need to index it with ElasticSearch or something and then parse it that way.

Super-NintendoUser fucked around with this message at 21:56 on Sep 7, 2023

Super-NintendoUser
Jan 16, 2004

COWABUNGERDER COMPADRES
Soiled Meat

Hughmoris posted:

It's been a while since I've done any useful scripting but I'd reach for Python + Fuzzy Matching. Post this over in our Python thread and you'll likely get some good ideas to jumpstart you.

This is right answer, my code is hot garbage, but I'd make a first quick pass with checksums since that' super quick.

1) compare A internally with fdupes
2) compare B internally with fdupes
3) hash compare A to B with my terrible scripting
4) hash compare B to A with my terrible scripting
Not necessary since you know B is unique against A at this point.

Then you know for sure all the easy ones are gone. You may find you don't have as many edge cases as you think.

I re-read your post, you want to do a lot more intelligence on this, but you still can with the chksums, you can use that to make reports, and then manually go over them, but I'd suggest just blasting any matches straight up and then starting.

Super-NintendoUser fucked around with this message at 22:51 on Sep 7, 2023

jaegerx
Sep 10, 2012

Maybe this post will get me on your ignore list!


Ok. Which one of you is this

Nsfw

No nudity but definitely a computer toucher


e: I didn't know it'd show that big, I was on awful app.

jaegerx fucked around with this message at 01:14 on Sep 8, 2023

CitizenKain
May 27, 2001

That was Gary Cooper, asshole.

Nap Ghost

jaegerx posted:

Ok. Which one of you is this

Nsfw https://hw-videos.worldstarhiphop.com/u/vid/2023/09/qb1uzotFFPRy.mp4

No nudity but definitely a computer toucher.

No

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Super-NintendoUser posted:

Just to make sure I'm clear, there's duplicates between bundle A and bundle B, but not between two documents in A or B themselves?

I'd start with the easiest set of matches ...

Good stuff...

I really should take time to better learn shell scripting.

Jiro
Jan 13, 2004

jaegerx posted:

Ok. Which one of you is this

Nsfw https://hw-videos.worldstarhiphop.com/u/vid/2023/09/qb1uzotFFPRy.mp4

No nudity but definitely a computer toucher.

:barf:

Reminds me of how bad a friend's room would get......

Super-NintendoUser
Jan 16, 2004

COWABUNGERDER COMPADRES
Soiled Meat

Hughmoris posted:

I really should take time to better learn shell scripting.

Start working in Linux or Macos terminal. Try doing simple things via bash even if it's faster manual, just to get some exposure. Scripting is just complex combinations of simple tasks so learn the simple tasks and then it'll just start fitting together.

If you are a novice at scripting, I'd highly suggest learning Python. Bash is nice, but Python is way more powerful, flexible, and portable.

The Fool
Oct 16, 2003


powershell is a thing you know

jaegerx
Sep 10, 2012

Maybe this post will get me on your ignore list!


The Fool posted:

powershell is a thing you know

^ while not for linux, you can hack that poo poo to work on linux.

FISHMANPET
Mar 3, 2007

Sweet 'N Sour
Can't
Melt
Steel Beams
And by "hack" it just means install the shell from PowerShell core releases or it might even be in your system's package manager.

i am a moron
Nov 12, 2020

"I think if there’s one thing we can all agree on it’s that Penn State and Michigan both suck and are garbage and it’s hilarious Michigan fans are freaking out thinking this is their natty window when they can’t even beat a B12 team in the playoffs lmao"
Or u can just use something that isn’t poo poo….

FISHMANPET
Mar 3, 2007

Sweet 'N Sour
Can't
Melt
Steel Beams
PowerShell is, I think, a much friendlier shell than Bash and the like. It's also a scripting language, but it's not as powerful as Python. It straddles the line between shell and scripting. I do think the pipeline and everything being an object in PowerShell vs everything being text in other Unix shells makes it a lot more powerful in a lot of cases.

i am a moron
Nov 12, 2020

"I think if there’s one thing we can all agree on it’s that Penn State and Michigan both suck and are garbage and it’s hilarious Michigan fans are freaking out thinking this is their natty window when they can’t even beat a B12 team in the playoffs lmao"
I was kidding I’ve used it extensively. Prefer Az CLI now but frankly I do almost 0 hands on work and that’s only applicable to azure stuff so my opinion is useless

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
Turned on APM for our biggest set of services today, which went great until our elastic cluster fell over. The APM data was only 22GB/hour, so not too bad, and certainly not enough to kill the cluster…

After about 3 hours of digging I realized I forgot to turn off verbose logging to stdout on the opentelemetry agent and collectors, which collectively produced a cool 220GB/hour of redundant log data, including every trace before sampling. Well done me.

At least this validated my design of offloading logs from service -> otel agent immediately, which meant applications don’t fail when telemetry does! And it’s kinda neat that the verbose logs were an exact order of magnitude greater in size than the traces and metrics!

i am a moron
Nov 12, 2020

"I think if there’s one thing we can all agree on it’s that Penn State and Michigan both suck and are garbage and it’s hilarious Michigan fans are freaking out thinking this is their natty window when they can’t even beat a B12 team in the playoffs lmao"
I’m writing a DR strategy doc for some fuckoff huge company and I’m realizing I hate every part of my job. I used to think kicking rear end on big projects was fun but all the joy has been sucked out of it for me. Why am I explaining the same thing to the same dead eyed middle aged chubsters at every company. Like loving Google this poo poo it’s literally all on their website

I just don’t care anymore. I need to move to sales and just spend my days golfing and drinking and running up my expenses

i am a moron
Nov 12, 2020

"I think if there’s one thing we can all agree on it’s that Penn State and Michigan both suck and are garbage and it’s hilarious Michigan fans are freaking out thinking this is their natty window when they can’t even beat a B12 team in the playoffs lmao"
It’s always the same questions too. ‘Buh what about my block level blah blah’ I DONT loving KNOW WHAT THAT MEANS. gently caress YOUR DEDUPLICATION. I’ve been doing this poo poo for virtually ever in IT years, mention SRDF to me again and I’ll jam a power edge so far up your rear end your nostrils will light up every time you fart

tokin opposition
Apr 8, 2021

I don't jailbreak the androids, I set them free.

WATCH MARS EXPRESS (2023)
If I had a quarter for every time I had to patiently reexplain something trivial to my boss or co-worker I'd be retired by now

If we include previous jobs I'd be living on the moon in a house made of coins

i am a moron
Nov 12, 2020

"I think if there’s one thing we can all agree on it’s that Penn State and Michigan both suck and are garbage and it’s hilarious Michigan fans are freaking out thinking this is their natty window when they can’t even beat a B12 team in the playoffs lmao"
For better or worse consulting sometimes is explaining the same thing over and over to people. The worst orgs have the exact same tendencies and running projects with them is like shooting fish in a barrel. My favorite client ever hated consultants and I went from a solo six week project to a 2 1/2 year reputation defining deal there. They challenged me and I challenged them. Most of our clients are mouth breathing idiots and that’s why they had to ask us wtf to do in the first place. It’s gotten really hollow at this point

Internet Explorer
Jun 1, 2005





i am a moron posted:

I used to think kicking rear end on big projects was fun but all the joy has been sucked out of it for me.

Same. Although you can go back 5 years in this thread and see me saying the same poo poo. I have definitely reached the point where this work rarely interests me anymore. But at the same time it also doesn't define who I am anymore either, which is a positive change.

I'm now working with absolutely giant corporations after doing mostly SMB throughout my career. Maybe a few thousand employees at the top end. And as a surprise to no one, it's just as bad if not worse.

I try not to talk about what I'm doing anymore because I don't want the bullshit that comes with it. But it's just all so dumb. Weeks of people mad... have you tried scaling out your Redis cluster? Problem solved. Weeks of people mad. You switched from one database engine to another and did no work on your queries or database architecture? No load testing before going to prod? And it needs to be fixed tomorrow? Good luck.

DR/BC sucks because no one takes it seriously but they all want to pretend they do. Look, let's all check the box and move on with our lives. Stop lying to me.

i am a moron
Nov 12, 2020

"I think if there’s one thing we can all agree on it’s that Penn State and Michigan both suck and are garbage and it’s hilarious Michigan fans are freaking out thinking this is their natty window when they can’t even beat a B12 team in the playoffs lmao"

Internet Explorer posted:


DR/BC sucks because no one takes it seriously but they all want to pretend they do. Look, let's all check the box and move on with our lives. Stop lying to me.

Reminds me of a question I’ve heard three times this year alone:

‘Can we keep the same IPs so we don’t have to mess with DNS?’

Uhh I mean technically you could but just don’t bother failing over at that point. You won’t get any of this poo poo up before the region/service comes up

And if that was your plan on prem then lmao

Vampire Panties
Apr 18, 2001
nposter
Nap Ghost

i am a moron posted:

I just don’t care anymore. I need to move to sales and just spend my days golfing and drinking and running up my expenses

:haibrow: I'm staring :airquote: retirement :airquote: in the face. Simply put, there is no amount of money that will get me to do SE work again.

so I either go back into the field, or go into sales proper. Wrench turning work is cathartic, and the pay can be pretty good, but companies worship sales people.


i am a moron posted:

For better or worse consulting sometimes is explaining the same thing over and over to people. The worst orgs have the exact same tendencies and running projects with them is like shooting fish in a barrel. My favorite client ever hated consultants and I went from a solo six week project to a 2 1/2 year reputation defining deal there. They challenged me and I challenged them. Most of our clients are mouth breathing idiots and that’s why they had to ask us wtf to do in the first place. It’s gotten really hollow at this point

:same: this is the part that kills me about SE work - all of our customers are undergoing the same challenges, because they're all using the same poo poo, because WE loving sold it to them. Even top-flight AMs and Sales VPs are like "ok, what are we gonna do to take down this big sale (that is literally loving identical to the last big sale we took down)"
"... do the same thing?"
thrown-out-conference-window.meme.jpg

Vampire Panties fucked around with this message at 05:21 on Sep 8, 2023

tokin opposition
Apr 8, 2021

I don't jailbreak the androids, I set them free.

WATCH MARS EXPRESS (2023)
every sales person has made my skin crawl, I don't know what kind of person willing goes into it but I do know I try not to look them in the eye in case that's how they spread

Internet Explorer
Jun 1, 2005





There's no way, I couldn't do it.

ihafarm
Aug 12, 2004

Agrikk posted:

I need some help.

I have two collections of PDF files: one is 171GB and 11,400 files, the other is 660GB and 50,000 files.

I would like to merge the collections into a single one and eliminate the duplicates.

The issue is that both collections have "reorganized" the contents so the directory tree isn't the same. In addition, each collection has copy/paste problems so there might be duplicates within themselves. To muddy the waters still further, ther might be multiples in each collection but are named slightly different, with slightly different file sizes. So "[publisher][id] - filename.pdf" might be a 7k byte file and "filename.publisher.id" might be 6k but both will have the same content and differ only in scan resolution/OCR/optimization.

What I am imagining is some way to create a spreadsheet of each collection somehow and then do a LIKE funtion somehow to find and list similarly named files from both collections. Even cooler would be to identify the similar files and group them together so we can then pick the one we want and discard the rest.

I'm thinking maybe use robocopy or some other utility to create a massive file list of each collection and combine them to sort by name, but that doesn't solve the "kinda identical name" problem.

BeyondCompare https://www.scootersoftware.com

Trial is fully functional, 30 day countdown only tracks days where application is launched, not contiguous days.

Super-NintendoUser
Jan 16, 2004

COWABUNGERDER COMPADRES
Soiled Meat
Beyond compare is basically magic, but for that task (files in weird mismatched folder trees) I don't think it will work as he needs.

Organic Lube User
Apr 15, 2005

tokin opposition posted:

every sales person has made my skin crawl, I don't know what kind of person willing goes into it but I do know I try not to look them in the eye in case that's how they spread

It takes an ability to relish lying to people. It's a skill you only have because of a moral failure.

BIG FLUFFY DOG
Feb 16, 2011

On the internet, nobody knows you're a dog.


We’ve been having consistent issues with two particular sales people who keep trying to circumvent policy for making changes to the customers setup by just calling until they get someone who will do it

tehinternet
Feb 14, 2005

Semantically, "you" is both singular and plural, though syntactically it is always plural. It always takes a verb form that originally marked the word as plural.

Also, there is no plural when the context is an argument with an individual rather than a group. Somfin shouldn't put words in my mouth.

Organic Lube User posted:

It takes an ability to relish lying to people. It's a skill you only have because of a moral failure.

Yeah, sales people are the worst and nearly uniformly the dumbest fucks.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
From J2 yesterday:

VP of IT: what’s the status on the network segmentation project?

Director of IT: Agrikk, what’s the status of the network segmentation project?

Agrikk: network segmentation project…?


That meeting did not go well for my boss, so we sync up after to talk about this project:

Project plan? Nope
List of workloads to move to new network? Nope
Diagram of existing network? Nope
Diagram of future state? Nope
Requirements document? Nope

loving problem statement detailing why this has to happen? Of course not.

But that’s not stopping networking from vomiting workloads Willy-nilly all over their new switches.


Never mind architecture. Apparently I’ve been hired to be the company’s documentarian.


But you pay me senior architect money so idgaf. All I know is my 30-year mortgage with 27 years remaining is getting paid off in 24 months.


Also: thanks for the tips on my file compare thing. I’ll give your suggestions a whirl and se what’s what.

Agrikk fucked around with this message at 16:34 on Sep 8, 2023

Handsome Ralph
Sep 3, 2004

Oh boy, posting!
That's where I'm a Viking!


tehinternet posted:

Yeah, sales people are the worst and nearly uniformly the dumbest fucks.

See also, doctors.

Had one pester me that our servers or VPN were hosed and that it couldn't possibly be his home internet. He kept hanging up on me and my coworker till another coworker held his hand, made him do a ping command and got him to understand that yes, his home internet was poo poo and that we had no way of fixing that.

Inner Light
Jan 2, 2020



Agrikk posted:

But you pay me senior architect money so idgaf. All I know is my 30-year mortgage with 27 years remaining is getting paid off in 24 months.

I’ve been having a pretty meh year and my gig is definitely not swinging senior architect comp. You all hiring?

skipdogg
Nov 29, 2004
Resident SRT-4 Expert

Inner Light posted:

I’ve been having a pretty meh year and my gig is definitely not swinging senior architect comp. You all hiring?

IIRC that's his 2nd job too

Wibla
Feb 16, 2011

Agrikk posted:

Project plan? Nope
List of workloads to move to new network? Nope
Diagram of existing network? Nope
Diagram of future state? Nope
Requirements document? Nope

loving problem statement detailing why this has to happen? Of course not.

But that’s not stopping networking from vomiting workloads Willy-nilly all over their new switches.

:laffo:

We're (a little bit) behind on our new network buildout, because I'm an eternal time optimist, but reading this I don't feel bad about it anymore :haw:
(We have all of those things)

Agrikk posted:

But you pay me senior architect money so idgaf. All I know is my 30-year mortgage with 27 years remaining is getting paid off in 24 months.

Hell yeah, that's the way to go :black101:

Thanks Ants
May 21, 2004

#essereFerrari


Agrikk posted:

loving problem statement detailing why this has to happen? Of course not.

All anybody needs to bring me to get me to love them forever is a summary of the step zero part of the project which is "what is the problem we are fixing". So often I've seen this omitted and the most top-level discussion is "here's what we are implementing" and it never ends well.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

skipdogg posted:

IIRC that's his 2nd job too

That’s correct. After doing IT for over thirty years the hardest part of the overemployed thing is finding two compatible jobs and carefully managing my calendars. This especially includes non-compete clauses and multiple employer provisions.

Calendar collisions can be tricky but I have absolutely come to realize how many meetings I have in which I don’t contribute anything at all.

And company loyalty? gently caress that. I am grabbing all the cash i can as hard as I can.

Adbot
ADBOT LOVES YOU

xzzy
Mar 5, 2009

I was ranting about scummy sales people and how they're all garbage people and one guy pipes up "uhh I used to do sales."

Kinda awkward but I stood by my claim.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply