Coding Horrors: You can gather all your technical debt into one easy framework!

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Coding Horrors: You can gather all your technical debt into one easy framework!

«‹›1503 »

Hammerite: Mar 9, 2007; And you don't remember what I said here, either, but it was pompous and stupid.; Jade Ear Joe

PHP is deprecated :siren:

# ? Apr 12, 2011 14:14

Adbot: ADBOT LOVES YOU

# ? May 14, 2024 07:43

Flobbster: Feb 17, 2005; "Cadet Kirk, after the way you cheated on the Kobayashi Maru test I oughta punch you in tha face!"

I hate to interrupt the thread's bi-weekly PHP discussion and all...

So for those of you unfortunate enough to know what Apple WebObjects is, I maintain an application that runs on that. (In short, it's a super-old Java-based web app framework). There's an add-on for it called "Project Wonder", which basically adds a ton of poo poo on top of WO that ranges anywhere from little use to super useful.

I was poking around in the source for their localizer class, seeing what rules it used to automatically generate the plural forms of strings if none was provided:

code:

protected Map defaultPlurifyRules() {
	Map defaultPlurifyRules = new LinkedHashMap();

	defaultPlurifyRules.put(Pattern.compile("^equipment$", Pattern.CASE_INSENSITIVE), "equipment");
	defaultPlurifyRules.put(Pattern.compile("^information$", Pattern.CASE_INSENSITIVE), "information");
	defaultPlurifyRules.put(Pattern.compile("^rice$", Pattern.CASE_INSENSITIVE), "rice");
	defaultPlurifyRules.put(Pattern.compile("^money$", Pattern.CASE_INSENSITIVE), "money");
	defaultPlurifyRules.put(Pattern.compile("^species$", Pattern.CASE_INSENSITIVE), "species");
	defaultPlurifyRules.put(Pattern.compile("^series$", Pattern.CASE_INSENSITIVE), "series");
	defaultPlurifyRules.put(Pattern.compile("^fish$", Pattern.CASE_INSENSITIVE), "fish");
	defaultPlurifyRules.put(Pattern.compile("^sheep$", Pattern.CASE_INSENSITIVE), "sheep");

	defaultPlurifyRules.put(Pattern.compile("(.*)person$", Pattern.CASE_INSENSITIVE), "$1people");
	defaultPlurifyRules.put(Pattern.compile("(.*)man$", Pattern.CASE_INSENSITIVE), "$1men");
	defaultPlurifyRules.put(Pattern.compile("(.*)child$", Pattern.CASE_INSENSITIVE), "$1children");
	defaultPlurifyRules.put(Pattern.compile("(.*)sex$", Pattern.CASE_INSENSITIVE), "$1sexes");
	defaultPlurifyRules.put(Pattern.compile("(.*)move$", Pattern.CASE_INSENSITIVE), "$1moves");
	
	defaultPlurifyRules.put(Pattern.compile("(.*)(quiz)$", Pattern.CASE_INSENSITIVE), "$1$2zes");
	defaultPlurifyRules.put(Pattern.compile("(.*)^(ox)$", Pattern.CASE_INSENSITIVE), "$1$2en");
	defaultPlurifyRules.put(Pattern.compile("(.*)([m|l])ouse$", Pattern.CASE_INSENSITIVE), "$1$2ice");
	defaultPlurifyRules.put(Pattern.compile("(.*)(matr|vert|ind)ix|ex$", Pattern.CASE_INSENSITIVE), "$1$2ices");
	defaultPlurifyRules.put(Pattern.compile("(.*)(x|ch|ss|sh)$", Pattern.CASE_INSENSITIVE), "$1$2es");
	defaultPlurifyRules.put(Pattern.compile("(.*)([^aeiouy]|qu)y$", Pattern.CASE_INSENSITIVE), "$1$2ies");
	defaultPlurifyRules.put(Pattern.compile("(.*)(hive)$", Pattern.CASE_INSENSITIVE), "$1$2s");
	defaultPlurifyRules.put(Pattern.compile("(.*)(?:([^f])fe|([lr])f)$", Pattern.CASE_INSENSITIVE), "$1$2$3ves");
	defaultPlurifyRules.put(Pattern.compile("(.*)sis$", Pattern.CASE_INSENSITIVE), "$1ses");
	defaultPlurifyRules.put(Pattern.compile("(.*)([ti])um$", Pattern.CASE_INSENSITIVE), "$1$2a");
	defaultPlurifyRules.put(Pattern.compile("(.*)(buffal|tomat)o$", Pattern.CASE_INSENSITIVE), "$1$2oes");
	defaultPlurifyRules.put(Pattern.compile("(.*)(bu)s$", Pattern.CASE_INSENSITIVE), "$1$2ses");
	defaultPlurifyRules.put(Pattern.compile("(.*)(alias|status)$", Pattern.CASE_INSENSITIVE), "$1$2es");
	defaultPlurifyRules.put(Pattern.compile("(.*)(octop|vir)us$", Pattern.CASE_INSENSITIVE), "$1$2i");
	defaultPlurifyRules.put(Pattern.compile("(.*)(ax|test)is$", Pattern.CASE_INSENSITIVE), "$1$2es");
	defaultPlurifyRules.put(Pattern.compile("(.*)s$", Pattern.CASE_INSENSITIVE), "$1s");
	defaultPlurifyRules.put(Pattern.compile("(.*)$", Pattern.CASE_INSENSITIVE), "$1s");

	return defaultPlurifyRules;
}

I mean that's cool and all, but I chuckled at which of the words they deemed important enough to have special cases, like "buffalo" and "tomato", but "potato" gets left out.

Oh, and a horror hidden in a horror: more than one "octopus" is not "octopi"!

# ? Apr 12, 2011 14:51

Bonfire Lit: Jul 9, 2008; If you're one of the sinners who caused this please unfriend me now.

That entire rule is bogus. The plural of "virus" isn't "viri".

# ? Apr 12, 2011 15:27

zeekner: Jul 14, 2007

Why does a localizer class do any form of pluralization? :psyduck:

# ? Apr 12, 2011 15:51

Opinion Haver: Apr 9, 2007

And why does it call it 'plurify'ing?

# ? Apr 12, 2011 16:23

trex eaterofcadrs: Jun 17, 2005; My lack of understanding is only exceeded by my lack of concern.

boxen? goddamn nerds.

# ? Apr 12, 2011 16:42

Munkeymon: Aug 14, 2003; Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.

TRex EaterofCars posted:

boxen? goddamn nerds.

I hate that word, too, but that expression doesn't match box.

# ? Apr 12, 2011 17:45

zeekner: Jul 14, 2007

code:

defaultPlurifyRules.put(Pattern.compile("(.*)man$", Pattern.CASE_INSENSITIVE), "$1men");
defaultPlurifyRules.put(Pattern.compile("(.*)child$", Pattern.CASE_INSENSITIVE), "$1children");

Manchild -> Menchildren?

e: gently caress, didn't notice $

# ? Apr 12, 2011 17:48

trex eaterofcadrs: Jun 17, 2005; My lack of understanding is only exceeded by my lack of concern.

Munkeymon posted:

I hate that word, too, but that expression doesn't match box.

You're right. What is that (.*) doing in front of the line-start symbol though? That's what threw me off.

# ? Apr 12, 2011 17:52

zeekner: Jul 14, 2007

TRex EaterofCars posted:

You're right. What is that (.*) doing in front of the line-start symbol though? That's what threw me off.

http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
. Any character (may or may not match line terminators)
X* X, zero or more times

So, anything zero or more times, basically a pure wildcard. I assume strings are tokenized before calling these functions.

# ? Apr 12, 2011 18:07

trex eaterofcadrs: Jun 17, 2005; My lack of understanding is only exceeded by my lack of concern.

Geekner posted:

http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
. Any character (may or may not match line terminators)
X* X, zero or more times

So, anything zero or more times, basically a pure wildcard. I assume strings are tokenized before calling these functions.

I know the regular expression syntax, but what possible input could be matched by that first group?

# ? Apr 12, 2011 18:32

Zombywuf: Mar 29, 2008

Geekner posted:

http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
. Any character (may or may not match line terminators)
X* X, zero or more times

So, anything zero or more times, basically a pure wildcard. I assume strings are tokenized before calling these functions.

I think you missed the point. ^ here's the point, right here ^

# ? Apr 12, 2011 18:33

Jethro: Jun 1, 2000; I was raised on the dairy, Bitch!

Flobbster posted:

I mean that's cool and all, but I chuckled at which of the words they deemed important enough to have special cases, like "buffalo" and "tomato", but "potato" gets left out.

Dan Quayle account found.

# ? Apr 12, 2011 18:46

Munkeymon: Aug 14, 2003; Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.

TRex EaterofCars posted:

You're right. What is that (.*) doing in front of the line-start symbol though? That's what threw me off.

It looks like someone went in there with the intent to add a special case for ox, decided to copy and paste a line from the section of the function where expressions that operate based on what the ends of words look like rather than copying and pasting one based on replacing only whole words and then they or someone else 'fixed' it by sticking the beginning of line anchor in an odd spot when they realized they were creating 'boxen' as well as 'oxen' and that people would rightly mock them for unironically using the word 'boxen'.

It works because (.*) can match nothing (as in, a zero-length string) without error. At least, that's the best way I can think to put it. I'm sure someone will correct me if I'm not stating it perfectly.

# ? Apr 12, 2011 20:01

pokeyman: Nov 26, 2006; That elephant ate my entire platoon.

Munkeymon posted:

It works because (.*) can match nothing (as in, a zero-length string) without error. At least, that's the best way I can think to put it. I'm sure someone will correct me if I'm not stating it perfectly.

Dot is "any one character", star is "zero or more times". So the empty string is indeed matched by .*.

# ? Apr 12, 2011 22:55

Zombywuf: Mar 29, 2008

You see, the . is like the joker in a deck of cards, it's a wildcard that can represent any other card or "character". The * is like a photocopier, it can take a single document and make any number of copies of that document, only this photocopier has a build in shredder that can destroy the original document leaving you with no documents, or continuing the analogy "characters".

# ? Apr 12, 2011 23:17

Cocoa Crispies: Jul 20, 2001; Vehicular Manslaughter!; Pillbug

. accepts any character, * is a Kleene star closure, your automata theory textbook should cover all this.

# ? Apr 13, 2011 01:23

trex eaterofcadrs: Jun 17, 2005; My lack of understanding is only exceeded by my lack of concern.

I understand the DFA this thing makes. What I thought I was missing was some actual bona fide reason to have a capture group expression before the ^.

# ? Apr 13, 2011 01:34

wellwhoopdedooo: Nov 23, 2007; Pound Trooper!

If it was a multi-line regex, it could potentially not be empty, provided ox was handled completely different from every other token.

See: Using ^ and $ as Start of Line and End of Line Anchors

But, they're not multiline regexes, somebody just doesn't know how to regex. Also unless I missed something, they'll pluralize "table" as "tablels" (e: oh derp mistook 1 for l. whatever, it's still a dumb idea). ~~If they missed something that basic, no doubt they've missed plenty more.~~ Auto-pluralization is the dumbest god-drat idea since strcpy().

wellwhoopdedooo fucked around with this message at 02:37 on Apr 13, 2011

# ? Apr 13, 2011 02:35

1337JiveTurkey: Feb 17, 2005

pokeyman posted:

Dot is "any one character", star is "zero or more times". So the empty string is indeed matched by .*.

Beyond that any string is matched by .* because it eagerly evaluates. Even if you took out the ^ in the middle, ^(.*)(ox)$ can never match anything.

# ? Apr 13, 2011 03:23

Opinion Haver: Apr 9, 2007

1337JiveTurkey posted:

Beyond that any string is matched by .* because it eagerly evaluates. Even if you took out the ^ in the middle, ^(.*)(ox)$ can never match anything.

Only if your regex implementation is dumb and doesn't backtrack for some reason:

code:

$ perl -e 'print "match" if "ox" =~ /^(.*)(ox)$/'
match

# ? Apr 13, 2011 03:26

1337JiveTurkey: Feb 17, 2005

You have to explicitly tell the standard Java regex to backtrack with (.*?) because it can significantly affect performance.

# ? Apr 13, 2011 03:48

more like dICK: Feb 15, 2010; This is inevitable.

1337JiveTurkey posted:

You have to explicitly tell the standard Java regex to backtrack with (.*?) because it can significantly affect performance.

I've had to do a lot of work with regexes in Python and Java, and seemingly mysterious regex failures in Java stumped me for a while.

Not having industry standard regex syntax is a horror.

# ? Apr 13, 2011 04:00

Incoherence: May 22, 2004; POYO AND TEAR

BonzoESC posted:

. accepts any character, * is a Kleene star closure, your automata theory textbook should cover all this.

And then you read your automata theory textbook a little more closely and discover that regexps in the sense that most languages use them do not fit under the formal definition of "regular expressions". So... maybe that's not such a good idea.

# ? Apr 13, 2011 04:18

defmacro: Sep 27, 2005; cacio e ping pong

Incoherence posted:

And then you read your automata theory textbook a little more closely and discover that regexps in the sense that most languages use them do not fit under the formal definition of "regular expressions". So... maybe that's not such a good idea.

The formal definitions offer great approximations of how they're used in practice. You learn about what, lazy/greedy matches and backreferences and you're all set? Do you honestly think understanding formal languages will somehow make it harder to learn regular expressions?

# ? Apr 13, 2011 04:31

nielsm: Jun 1, 2009

I think the problem is that "regular expressions" as most languages implement them are not all that regular. It can be a problem when the time to match a string goes from O(n) to O(n^2) because your "regular expression" engine turns out to not actually use DFMs.
It'd be nice to replace those irregular expressions with regular expressions and parser generators... one can wish.

# ? Apr 13, 2011 04:45

Incoherence: May 22, 2004; POYO AND TEAR

defmacro posted:

The formal definitions offer great approximations of how they're used in practice. You learn about what, lazy/greedy matches and backreferences and you're all set? Do you honestly think understanding formal languages will somehow make it harder to learn regular expressions?

I was reacting to the implication that if you'd just read an automata textbook, you'd understand PCRE. You might know what some of the symbols mean, but you'd have an inadequate understanding of the power of a "modern regex implementation" because the things you mentioned aren't "regular expressions" in the formal sense. Imagine if you played a game of chess without knowing what the knight did, or even knowing that it existed.

# ? Apr 13, 2011 04:57

Lysidas: Jul 26, 2002; John Diefenbaker is a madman who thinks he's John Diefenbaker.; Pillbug

nielsm posted:

from O(n) to O(n^2)

It seems to be worse than that when you allow backreferences

# ? Apr 13, 2011 04:58

Rusty Krustyman: Dec 5, 2002; WARNING: Wearing this jersey may result in serious injury!

I see a lot of this:

code:

if (someBoolean != FALSE) {

now see, if that were consistent everywhere, the whole "not equal false" thing wouldn't be too bad. but it's not. same file you'll get the much happier "someBoolean == TRUE", which is fine. what kills me, though, is the occasional "== FALSE" right near a "!= FALSE". When I'm staring at the debugger and attempting to figure out the code path, I'll occasionally mix up a "!=" and "==". It drives me nuts.

# ? Apr 13, 2011 06:47

Mustach: Mar 2, 2003; In this long line, there's been some real strange genes. You've got 'em all, with some extras thrown in.

Lysidas posted:

It seems to be worse than that when you allow backreferences

Also, everybody should read this series: http://swtch.com/~rsc/regexp/regexp1.html

# ? Apr 13, 2011 12:52

Opinion Haver: Apr 9, 2007

Mustach posted:

Also, everybody should read this series: http://swtch.com/~rsc/regexp/regexp1.html

Huh, this looks interesting. Any idea why the mentioned languages don't use his approach?

# ? Apr 13, 2011 14:30

Zombywuf: Mar 29, 2008

yaoi prophet posted:

Huh, this looks interesting. Any idea why the mentioned languages don't use his approach?

Look at the graph for the part before the Perl one goes crazy. It's orders of magnitude slower for the non pathological case. It also doesn't do back references. I think the only way this could be faster in the general case is that you could potentially do some bit-twiddling to match a whole SIMD register's width of regexes at the same time, maybe.

# ? Apr 13, 2011 14:45

Malloc Voidstar: May 7, 2007; Fuck the cowboys. Unf. Fuck em hard.

Zombywuf posted:

Look at the graph for the part before the Perl one goes crazy. It's orders of magnitude slower for the non pathological case.

Important note: The Perl graph's y-axis is seconds, the Thompson NFA's y-axis is microseconds.

Also see:

# ? Apr 13, 2011 14:47

nielsm: Jun 1, 2009

Zombywuf posted:

Look at the graph for the part before the Perl one goes crazy.

Remember that the two graphs on the top have different vertical scales. The Perl one measures seconds, the DFM one measures microseconds.

# ? Apr 13, 2011 14:49

tef: May 30, 2004; -> some l-system crap ->

yaoi prophet posted:

Huh, this looks interesting. Any idea why the mentioned languages don't use his approach?

it's henry bakers fault

# ? Apr 13, 2011 14:56

tef: May 30, 2004; -> some l-system crap ->

DIW posted:

Not having industry standard regex syntax is a horror.

POSIX

# ? Apr 13, 2011 15:00

tef: May 30, 2004; -> some l-system crap ->

edit: read thread

# ? Apr 13, 2011 15:05

Cocoa Crispies: Jul 20, 2001; Vehicular Manslaughter!; Pillbug

nielsm posted:

It'd be nice to replace those irregular expressions with regular expressions and parser generators... one can wish.

This is generally what I do. I find https://github.com/bkerley/crapshoot/blob/master/lib/crapshoot/parser/scan.rl#L19-%23L30 to be more readable than a regexp that could parse the same thing.

# ? Apr 13, 2011 15:32

Scaevolus: Apr 16, 2007

Zombywuf posted:

Look at the graph for the part before the Perl one goes crazy. It's orders of magnitude slower for the non pathological case. It also doesn't do back references. I think the only way this could be faster in the general case is that you could potentially do some bit-twiddling to match a whole SIMD register's width of regexes at the same time, maybe.

People have already corrected you on your speed assumptions, but note that Google developed the re2 library because they needed to be able to run regexes against database columns with predictable performance.

# ? Apr 13, 2011 17:06

Adbot: ADBOT LOVES YOU

# ? May 14, 2024 07:43

tef: May 30, 2004; -> some l-system crap ->

Scaevolus posted:

People have already corrected you on your speed assumptions, but note that Google developed the re2 library because they needed to be able to run regexes against database columns with predictable performance.

in specific: bounded worst case performance

# ? Apr 13, 2011 17:29

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Coding Horrors: You can gather all your technical debt into one easy framework!

«‹›1503 »