Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
qntm
Jun 17, 2009

Bloody posted:

idk why you'd ever use git from a command line

every Git GUI I've ever used eventually gives up and tells you to run certain commands manually, or just gives up entirely

Adbot
ADBOT LOVES YOU

pokeyman
Nov 26, 2006

That elephant ate my entire platoon.

HoboMan posted:

git Suits My Needs and is super good as long as you remember what the command is. i have to google how to remove a tag every time

i once found a tag called "rm". still makes me chuckle sadly

jony neuemonic
Nov 13, 2009

Power Ambient posted:

for me its because every gui has loving sucked also i have a sweet mechanical kb so typing is very good

same and same.

somehow i've become the resident git expert at work. i think it's because i know how to rebase.

Bloody
Mar 3, 2013

qntm posted:

every Git GUI I've ever used eventually gives up and tells you to run certain commands manually, or just gives up entirely

stop breaking poo poo so badly then i guess

Luigi Thirty
Apr 30, 2006

Emergency confection port.

I use sourcetree and it is needs-suiting

DONT THREAD ON ME
Oct 1, 2002

by Nyc_Tattoo
Floss Finder

GameCube posted:

lol this might be it. God dammit

lol plz keep us updated

i assumed that your thing was different because my thing happened in ruby which is prone to errors like that

DONT THREAD ON ME
Oct 1, 2002

by Nyc_Tattoo
Floss Finder
also the go 'debugger' repl is so lol

Sapozhnik
Jan 2, 2005

Nap Ghost

GameCube posted:

lol this might be it. God dammit

lol this is shameful

wouldn't it be great if http actually already had a dedicated status code for a uri that's too long? no, surely the protocol's designers would never think of doing that.

HoboMan
Nov 4, 2010

wait, this is a thing? how long is "too long"? this might gently caress me down the road

netcat
Apr 29, 2008

jony neuemonic posted:

same and same.

somehow i've become the resident git expert at work. i think it's because i know how to rebase.

lol same. I also know aboyt the awesome power of "git reflog" so I can magically restore everyone's broken branches when they inevitable gently caress up a rebase.

abraham linksys
Sep 6, 2010

:darksouls:
afaik there's no hard-defined url length limit in clients or servers, or all those sites that work by base64ing user-generated content in a query parameter wouldn't work? it's just something you have to configure on your server end (nginx: http://nginx.org/en/docs/http/ngx_http_core_module.html#large_client_header_buffers, gunicorn: http://docs.gunicorn.org/en/latest/settings.html?highlight=limit_request_line#limit-request-line)

MononcQc
May 29, 2007

Mr Dog posted:

lol this is shameful

wouldn't it be great if http actually already had a dedicated status code for a uri that's too long? no, surely the protocol's designers would never think of doing that.

status 414 only works if the server complains, not if (probably) the client silently truncates the URL and sends it to the server, which correctly says it is a 404 if it could support longer.

brap
Aug 23, 2004

Grimey Drawer
sourcetree is good for git

if im doing anything besides git add, git commit, git push, i usually do it in sourcetree

Zemyla posted:

Why is using snapshots instead of changesets a good idea?

it's a lot faster to do blames and poo poo

brap fucked around with this message at 17:11 on Jul 13, 2016

vodkat
Jun 30, 2012



cannot legally be sold as vodka
Hey I'm trying to work with matching names between two quite large databases and I was wondering if anyone here had some tips.

Firstly are their any packages for python that will make this sort of thing more painless? for removing all of the edge case prefix and postfixes that people love to enter for no reason. And secondly, whats the best way to handle slight differences in names between the databases, for example inconsistent use of middle names, last name/firstnames etc? I've seen some stack exchange answers suggesting fuzzy matching them but I'm not sure what the best way to implement this is.

It seems like this would be the sort of thing that people must run into all the time, but as a p. lovely programmer I'm not really sure what I should be doing.

Bloody
Mar 3, 2013

levenshtein distance can be a decent metric for fuzzy string matching

Shaman Linavi
Apr 3, 2012

i used fuzzywuzzy for one little project and it seemed to work ok

Deep Dish Fuckfest
Sep 6, 2006

Advanced
Computer Touching


Toilet Rascal

vodkat posted:

Hey I'm trying to work with matching names between two quite large databases and I was wondering if anyone here had some tips.

Firstly are their any packages for python that will make this sort of thing more painless? for removing all of the edge case prefix and postfixes that people love to enter for no reason. And secondly, whats the best way to handle slight differences in names between the databases, for example inconsistent use of middle names, last name/firstnames etc? I've seen some stack exchange answers suggesting fuzzy matching them but I'm not sure what the best way to implement this is.

It seems like this would be the sort of thing that people must run into all the time, but as a p. lovely programmer I'm not really sure what I should be doing.

there really isn't a one size fits all solution

cleaning and homogenizing data from different sources is always a huge pain. the "best" solution usually depends on what kind of errors are ok for whatever you're doing. in some cases false matches have to be avoided at all costs, so you just keep perfect matches and discard everything else. in other cases you know that one db has a tendency to have some prefixes or suffixes on names, so you just build up a list of the most common ones, do a first filter pass to remove those, and then do a perfect match between the result and the other db. or if false matches are ok with you, then yeah, doing some fuzzy matchings between the dbs and not giving much of a gently caress about it beyond that can work

there's also the issue of what "quite large database" means, because dealing with a few gigabytes versus something in the hundreds of terabytes range requires different approaches. if you're dealing with names though i'm guessing it's probably the former

fritz
Jul 26, 2003

qntm posted:


I like that creating a branch is git checkout -b not e.g. git create branch



git branch branchname
?

fritz
Jul 26, 2003

my stepdads beer posted:

i have to google how to unstage every time, or view staged diffs

"git status" tells you how to unstage:

code:
Your branch is up-to-date with '...'.
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)
....
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
....
Untracked files:
  (use "git add <file>..." to include in what will be committed)

qntm
Jun 17, 2009

fritz posted:

git branch branchname
?

okay so now the question is why there are two commands which do the same thing and every tutorial recommends the stupidly-named one

vodkat
Jun 30, 2012



cannot legally be sold as vodka

YeOldeButchere posted:

there really isn't a one size fits all solution

cleaning and homogenizing data from different sources is always a huge pain. the "best" solution usually depends on what kind of errors are ok for whatever you're doing. in some cases false matches have to be avoided at all costs, so you just keep perfect matches and discard everything else. in other cases you know that one db has a tendency to have some prefixes or suffixes on names, so you just build up a list of the most common ones, do a first filter pass to remove those, and then do a perfect match between the result and the other db. or if false matches are ok with you, then yeah, doing some fuzzy matchings between the dbs and not giving much of a gently caress about it beyond that can work

there's also the issue of what "quite large database" means, because dealing with a few gigabytes versus something in the hundreds of terabytes range requires different approaches. if you're dealing with names though i'm guessing it's probably the former

The database is just short of a gig which I guess is pretty small fry for most of the people here but as an academic and very lovely programmer its starting to test my knowledge and abilities quite a bit.

Having looked at fuzzywuzzy it seems like that might be what I need use but how do you define when a match is good enough? is it a matter of simply plugging in number and seeing what the result is or is there a better way than trial and error testing?

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

why is HEAD always written in all caps, and please tell me it's case sensitive because that would be the most unixy thing ever

Wheany
Mar 17, 2006

Spinyahahahahahahahahahahahaha!

Doctor Rope

qntm posted:

okay so now the question is why there are two commands which do the same thing and every tutorial recommends the stupidly-named one

because gently caress you, that's why

NihilCredo posted:

why is HEAD always written in all caps, and please tell me it's case sensitive because that would be the most unixy thing ever

because gently caress you, that's why

Shaman Linavi
Apr 3, 2012

vodkat posted:

Having looked at fuzzywuzzy it seems like that might be what I need use but how do you define when a match is good enough? is it a matter of simply plugging in number and seeing what the result is or is there a better way than trial and error testing?

that is exactly how i used it to check if user input is in a list.
if the input isnt in the list i have fuzzywuzzy check the input against the list and rip out strings over a certain value.
i just fudged around with the value until common spelling errors were giving back what i thought they should from the list.

VikingofRock
Aug 24, 2008




qntm posted:

okay so now the question is why there are two commands which do the same thing and every tutorial recommends the stupidly-named one

Those commands don't actually do the same thing. git branch foo creates a branch called foo but doesn't switch to it, whereas git checkout -b foo creates a branch called foo and switches to that branch.

qntm
Jun 17, 2009

VikingofRock posted:

Those commands don't actually do the same thing. git branch foo creates a branch called foo but doesn't switch to it, whereas git checkout -b foo creates a branch called foo and switches to that branch.

then there should be a variant of git branch which also switches to the newly-created branch, putting it on git checkout makes no sense

VikingofRock
Aug 24, 2008




Presumably the tutorials use git checkout -b foo because it's one less command you have to type, and git made the combo command "git checkout -b" instead of "git branch -c" because thought the checkout was the more important half of the operation. Or maybe they just want checkout to do literally everything.

disclaimer: git branch -c or the like might be a thing, but I'm on my phone so I can't check

The MUMPSorceress
Jan 6, 2012


^SHTPSTS

Gary’s Answer
all of this git talk is really confusing to me because i think we have our own tooling on top of whatever svn already does. when i make a "branch", i get a branch on the server and then all of that is checked out into a folder named after the branch on my computer and that's where i do my work. then i commit that to the branch on the server. when it's ready to go to trunk, our internal tool merges my branch into a local copy of trunk on my computer. then i commit that to trunk on the server.

how does that translate to gitspeak?

HoboMan
Nov 4, 2010

my git workflow
code:
# git status
# git commit -a -m "poo poo is probably less broken now"
# git status
# git tag v69(r219)
# git log
# git push --tags

Deep Dish Fuckfest
Sep 6, 2006

Advanced
Computer Touching


Toilet Rascal

vodkat posted:

The database is just short of a gig which I guess is pretty small fry for most of the people here but as an academic and very lovely programmer its starting to test my knowledge and abilities quite a bit.

Having looked at fuzzywuzzy it seems like that might be what I need use but how do you define when a match is good enough? is it a matter of simply plugging in number and seeing what the result is or is there a better way than trial and error testing?

since it fits in memory then you can do more or less whatever you want with it, so that's good

if you do go with fuzzy matching stuff, then yeah, there's no way around the fact that you'll have to define some arbitrary threshold as to what constitutes a match and what doesn't. most of the time it gets chosen through a very technical empirical process called "loving around with it until it looks good enough". i mean, doing that usually means writing a bit of code to figure out basic stats like how many match a given threshold gives you or how many rows in one db match to more than one row in the other (which will skyrocket if your threshold is too lenient), but that sort of stuff is exactly why cleaning up data always sucks

you should try to do as much processing to make things homogeneous before you try the fuzzy matching, though. the stuff like removing common prefixes or suffixes should be easy enough to do if you have a lot of that, and it will help with the fuzzy matching afterwards. for example if you're using edit distance (the levenshtein distance bloody mentioned earlier), then "Mr. Bob Smith" with "Mr. " removed would match right away with "Bob Smith" instead of requiring a threshold of 4 to account for the deletion of the prefix, which would also make it match with "Mr. Bob Smithwick" (4 character insertions) or "Mr. John Smith" (3 character replacements and 1 insertion) neither of which are what you want

Sapozhnik
Jan 2, 2005

Nap Ghost

Progressive JPEG posted:

lol if every commit isn't just swearing with increasing intensity

Bloody
Mar 3, 2013

YeOldeButchere posted:

since it fits in memory then you can do more or less whatever you want with it, so that's good

if you do go with fuzzy matching stuff, then yeah, there's no way around the fact that you'll have to define some arbitrary threshold as to what constitutes a match and what doesn't. most of the time it gets chosen through a very technical empirical process called "loving around with it until it looks good enough". i mean, doing that usually means writing a bit of code to figure out basic stats like how many match a given threshold gives you or how many rows in one db match to more than one row in the other (which will skyrocket if your threshold is too lenient), but that sort of stuff is exactly why cleaning up data always sucks

you should try to do as much processing to make things homogeneous before you try the fuzzy matching, though. the stuff like removing common prefixes or suffixes should be easy enough to do if you have a lot of that, and it will help with the fuzzy matching afterwards. for example if you're using edit distance (the levenshtein distance bloody mentioned earlier), then "Mr. Bob Smith" with "Mr. " removed would match right away with "Bob Smith" instead of requiring a threshold of 4 to account for the deletion of the prefix, which would also make it match with "Mr. Bob Smithwick" (4 character insertions) or "Mr. John Smith" (3 character replacements and 1 insertion) neither of which are what you want

set a threshold by scoring a ton of poo poo against random strings, calculate the standard deviation, and multiply by three

HoboMan
Nov 4, 2010

remember to make sure your random strings have uniform distribution!

Wheany
Mar 17, 2006

Spinyahahahahahahahahahahahaha!

Doctor Rope

i used some semi-inappropriate word as a temporary variable name while i was testing something and then immediately regretted it when i forgot to remove it and committed it. i caught it in review and removed it.

that was the first time i forgot to remove a temporary testing variable like that. it was also the first time i used something that was not just "qqqqq" as the name.

immediately it bit me.

HoboMan
Nov 4, 2010

Wheany posted:

i used some semi-inappropriate word as a temporary variable name while i was testing something and then immediately regretted it when i forgot to remove it and committed it. i caught it in review and removed it.

that was the first time i forgot to remove a temporary testing variable like that. it was also the first time i used something that was not just "qqqqq" as the name.

immediately it bit me.

lol, i came here to post that i just sent a poo and fart filled test framework for code review, woops

Bloody
Mar 3, 2013

HoboMan posted:

remember to make sure your random strings have uniform distribution!

hmm actually shouldn't their distribution match like typical letter distribution?

Luigi Thirty
Apr 30, 2006

Emergency confection port.

that's what they want you to think

HappyHippo
Nov 19, 2003
Do you have an Air Miles Card?

YeOldeButchere posted:

since it fits in memory then you can do more or less whatever you want with it, so that's good

if you do go with fuzzy matching stuff, then yeah, there's no way around the fact that you'll have to define some arbitrary threshold as to what constitutes a match and what doesn't. most of the time it gets chosen through a very technical empirical process called "loving around with it until it looks good enough". i mean, doing that usually means writing a bit of code to figure out basic stats like how many match a given threshold gives you or how many rows in one db match to more than one row in the other (which will skyrocket if your threshold is too lenient), but that sort of stuff is exactly why cleaning up data always sucks

you should try to do as much processing to make things homogeneous before you try the fuzzy matching, though. the stuff like removing common prefixes or suffixes should be easy enough to do if you have a lot of that, and it will help with the fuzzy matching afterwards. for example if you're using edit distance (the levenshtein distance bloody mentioned earlier), then "Mr. Bob Smith" with "Mr. " removed would match right away with "Bob Smith" instead of requiring a threshold of 4 to account for the deletion of the prefix, which would also make it match with "Mr. Bob Smithwick" (4 character insertions) or "Mr. John Smith" (3 character replacements and 1 insertion) neither of which are what you want

also if the number of close matches is small enough you can possibly resolve them manually. assuming this is an operation you only want to do once.

HoboMan
Nov 4, 2010

Bloody posted:

hmm actually shouldn't their distribution match like typical letter distribution?

probably

ok, be sure to find the character probability of your set and then make a distribution skewed to match that probability (including average string length)!


at least i think for the matching problem you want the probability of your set and not the general occurrence probability

Adbot
ADBOT LOVES YOU

Sapozhnik
Jan 2, 2005

Nap Ghost
I'm gonna open source some hobby code I wrote a while back and the poo poo I was writing even five years ago is goddamn embarassing. And it's all there in the Git history for people to point and laugh at.

At least I'm in the right thread!

  • Locked thread