Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
:siren: :siren: :siren: PHP is deprecated :siren: :siren: :siren:

Adbot
ADBOT LOVES YOU

Flobbster
Feb 17, 2005

"Cadet Kirk, after the way you cheated on the Kobayashi Maru test I oughta punch you in tha face!"
I hate to interrupt the thread's bi-weekly PHP discussion and all...

So for those of you unfortunate enough to know what Apple WebObjects is, I maintain an application that runs on that. (In short, it's a super-old Java-based web app framework). There's an add-on for it called "Project Wonder", which basically adds a ton of poo poo on top of WO that ranges anywhere from little use to super useful.

I was poking around in the source for their localizer class, seeing what rules it used to automatically generate the plural forms of strings if none was provided:

code:
protected Map defaultPlurifyRules() {
	Map defaultPlurifyRules = new LinkedHashMap();

	defaultPlurifyRules.put(Pattern.compile("^equipment$", Pattern.CASE_INSENSITIVE), "equipment");
	defaultPlurifyRules.put(Pattern.compile("^information$", Pattern.CASE_INSENSITIVE), "information");
	defaultPlurifyRules.put(Pattern.compile("^rice$", Pattern.CASE_INSENSITIVE), "rice");
	defaultPlurifyRules.put(Pattern.compile("^money$", Pattern.CASE_INSENSITIVE), "money");
	defaultPlurifyRules.put(Pattern.compile("^species$", Pattern.CASE_INSENSITIVE), "species");
	defaultPlurifyRules.put(Pattern.compile("^series$", Pattern.CASE_INSENSITIVE), "series");
	defaultPlurifyRules.put(Pattern.compile("^fish$", Pattern.CASE_INSENSITIVE), "fish");
	defaultPlurifyRules.put(Pattern.compile("^sheep$", Pattern.CASE_INSENSITIVE), "sheep");

	defaultPlurifyRules.put(Pattern.compile("(.*)person$", Pattern.CASE_INSENSITIVE), "$1people");
	defaultPlurifyRules.put(Pattern.compile("(.*)man$", Pattern.CASE_INSENSITIVE), "$1men");
	defaultPlurifyRules.put(Pattern.compile("(.*)child$", Pattern.CASE_INSENSITIVE), "$1children");
	defaultPlurifyRules.put(Pattern.compile("(.*)sex$", Pattern.CASE_INSENSITIVE), "$1sexes");
	defaultPlurifyRules.put(Pattern.compile("(.*)move$", Pattern.CASE_INSENSITIVE), "$1moves");
	
	defaultPlurifyRules.put(Pattern.compile("(.*)(quiz)$", Pattern.CASE_INSENSITIVE), "$1$2zes");
	defaultPlurifyRules.put(Pattern.compile("(.*)^(ox)$", Pattern.CASE_INSENSITIVE), "$1$2en");
	defaultPlurifyRules.put(Pattern.compile("(.*)([m|l])ouse$", Pattern.CASE_INSENSITIVE), "$1$2ice");
	defaultPlurifyRules.put(Pattern.compile("(.*)(matr|vert|ind)ix|ex$", Pattern.CASE_INSENSITIVE), "$1$2ices");
	defaultPlurifyRules.put(Pattern.compile("(.*)(x|ch|ss|sh)$", Pattern.CASE_INSENSITIVE), "$1$2es");
	defaultPlurifyRules.put(Pattern.compile("(.*)([^aeiouy]|qu)y$", Pattern.CASE_INSENSITIVE), "$1$2ies");
	defaultPlurifyRules.put(Pattern.compile("(.*)(hive)$", Pattern.CASE_INSENSITIVE), "$1$2s");
	defaultPlurifyRules.put(Pattern.compile("(.*)(?:([^f])fe|([lr])f)$", Pattern.CASE_INSENSITIVE), "$1$2$3ves");
	defaultPlurifyRules.put(Pattern.compile("(.*)sis$", Pattern.CASE_INSENSITIVE), "$1ses");
	defaultPlurifyRules.put(Pattern.compile("(.*)([ti])um$", Pattern.CASE_INSENSITIVE), "$1$2a");
	defaultPlurifyRules.put(Pattern.compile("(.*)(buffal|tomat)o$", Pattern.CASE_INSENSITIVE), "$1$2oes");
	defaultPlurifyRules.put(Pattern.compile("(.*)(bu)s$", Pattern.CASE_INSENSITIVE), "$1$2ses");
	defaultPlurifyRules.put(Pattern.compile("(.*)(alias|status)$", Pattern.CASE_INSENSITIVE), "$1$2es");
	defaultPlurifyRules.put(Pattern.compile("(.*)(octop|vir)us$", Pattern.CASE_INSENSITIVE), "$1$2i");
	defaultPlurifyRules.put(Pattern.compile("(.*)(ax|test)is$", Pattern.CASE_INSENSITIVE), "$1$2es");
	defaultPlurifyRules.put(Pattern.compile("(.*)s$", Pattern.CASE_INSENSITIVE), "$1s");
	defaultPlurifyRules.put(Pattern.compile("(.*)$", Pattern.CASE_INSENSITIVE), "$1s");

	return defaultPlurifyRules;
}
I mean that's cool and all, but I chuckled at which of the words they deemed important enough to have special cases, like "buffalo" and "tomato", but "potato" gets left out.

Oh, and a horror hidden in a horror: more than one "octopus" is not "octopi"!

Bonfire Lit
Jul 9, 2008

If you're one of the sinners who caused this please unfriend me now.

That entire rule is bogus. The plural of "virus" isn't "viri".

zeekner
Jul 14, 2007

Why does a localizer class do any form of pluralization? :psyduck:

Opinion Haver
Apr 9, 2007

And why does it call it 'plurify'ing?

trex eaterofcadrs
Jun 17, 2005
My lack of understanding is only exceeded by my lack of concern.
boxen? goddamn nerds.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



TRex EaterofCars posted:

boxen? goddamn nerds.

I hate that word, too, but that expression doesn't match box.

zeekner
Jul 14, 2007

code:
defaultPlurifyRules.put(Pattern.compile("(.*)man$", Pattern.CASE_INSENSITIVE), "$1men");
defaultPlurifyRules.put(Pattern.compile("(.*)child$", Pattern.CASE_INSENSITIVE), "$1children");
Manchild -> Menchildren?

e: gently caress, didn't notice $

trex eaterofcadrs
Jun 17, 2005
My lack of understanding is only exceeded by my lack of concern.

Munkeymon posted:

I hate that word, too, but that expression doesn't match box.

You're right. What is that (.*) doing in front of the line-start symbol though? That's what threw me off.

zeekner
Jul 14, 2007

TRex EaterofCars posted:

You're right. What is that (.*) doing in front of the line-start symbol though? That's what threw me off.
http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
. Any character (may or may not match line terminators)
X* X, zero or more times

So, anything zero or more times, basically a pure wildcard. I assume strings are tokenized before calling these functions.

trex eaterofcadrs
Jun 17, 2005
My lack of understanding is only exceeded by my lack of concern.

Geekner posted:

http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
. Any character (may or may not match line terminators)
X* X, zero or more times

So, anything zero or more times, basically a pure wildcard. I assume strings are tokenized before calling these functions.

I know the regular expression syntax, but what possible input could be matched by that first group?

Zombywuf
Mar 29, 2008

Geekner posted:

http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
. Any character (may or may not match line terminators)
X* X, zero or more times

So, anything zero or more times, basically a pure wildcard. I assume strings are tokenized before calling these functions.

I think you missed the point. ^ here's the point, right here ^

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Flobbster posted:

I mean that's cool and all, but I chuckled at which of the words they deemed important enough to have special cases, like "buffalo" and "tomato", but "potato" gets left out.
Dan Quayle account found.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



TRex EaterofCars posted:

You're right. What is that (.*) doing in front of the line-start symbol though? That's what threw me off.

It looks like someone went in there with the intent to add a special case for ox, decided to copy and paste a line from the section of the function where expressions that operate based on what the ends of words look like rather than copying and pasting one based on replacing only whole words and then they or someone else 'fixed' it by sticking the beginning of line anchor in an odd spot when they realized they were creating 'boxen' as well as 'oxen' and that people would rightly mock them for unironically using the word 'boxen'.

It works because (.*) can match nothing (as in, a zero-length string) without error. At least, that's the best way I can think to put it. I'm sure someone will correct me if I'm not stating it perfectly.

pokeyman
Nov 26, 2006

That elephant ate my entire platoon.

Munkeymon posted:

It works because (.*) can match nothing (as in, a zero-length string) without error. At least, that's the best way I can think to put it. I'm sure someone will correct me if I'm not stating it perfectly.

Dot is "any one character", star is "zero or more times". So the empty string is indeed matched by .*.

Zombywuf
Mar 29, 2008

You see, the . is like the joker in a deck of cards, it's a wildcard that can represent any other card or "character". The * is like a photocopier, it can take a single document and make any number of copies of that document, only this photocopier has a build in shredder that can destroy the original document leaving you with no documents, or continuing the analogy "characters".

Cocoa Crispies
Jul 20, 2001

Vehicular Manslaughter!

Pillbug
. accepts any character, * is a Kleene star closure, your automata theory textbook should cover all this.

trex eaterofcadrs
Jun 17, 2005
My lack of understanding is only exceeded by my lack of concern.
I understand the DFA this thing makes. What I thought I was missing was some actual bona fide reason to have a capture group expression before the ^.

wellwhoopdedooo
Nov 23, 2007

Pound Trooper!
If it was a multi-line regex, it could potentially not be empty, provided ox was handled completely different from every other token.

See: Using ^ and $ as Start of Line and End of Line Anchors

But, they're not multiline regexes, somebody just doesn't know how to regex. Also unless I missed something, they'll pluralize "table" as "tablels" (e: oh derp mistook 1 for l. whatever, it's still a dumb idea). If they missed something that basic, no doubt they've missed plenty more. Auto-pluralization is the dumbest god-drat idea since strcpy().

wellwhoopdedooo fucked around with this message at 02:37 on Apr 13, 2011

1337JiveTurkey
Feb 17, 2005

pokeyman posted:

Dot is "any one character", star is "zero or more times". So the empty string is indeed matched by .*.

Beyond that any string is matched by .* because it eagerly evaluates. Even if you took out the ^ in the middle, ^(.*)(ox)$ can never match anything.

Opinion Haver
Apr 9, 2007

1337JiveTurkey posted:

Beyond that any string is matched by .* because it eagerly evaluates. Even if you took out the ^ in the middle, ^(.*)(ox)$ can never match anything.

Only if your regex implementation is dumb and doesn't backtrack for some reason:

code:
$ perl -e 'print "match" if "ox" =~ /^(.*)(ox)$/'
match

1337JiveTurkey
Feb 17, 2005

You have to explicitly tell the standard Java regex to backtrack with (.*?) because it can significantly affect performance.

more like dICK
Feb 15, 2010

This is inevitable.

1337JiveTurkey posted:

You have to explicitly tell the standard Java regex to backtrack with (.*?) because it can significantly affect performance.

I've had to do a lot of work with regexes in Python and Java, and seemingly mysterious regex failures in Java stumped me for a while.

Not having industry standard regex syntax is a horror.

Incoherence
May 22, 2004

POYO AND TEAR

BonzoESC posted:

. accepts any character, * is a Kleene star closure, your automata theory textbook should cover all this.
And then you read your automata theory textbook a little more closely and discover that regexps in the sense that most languages use them do not fit under the formal definition of "regular expressions". So... maybe that's not such a good idea.

defmacro
Sep 27, 2005
cacio e ping pong

Incoherence posted:

And then you read your automata theory textbook a little more closely and discover that regexps in the sense that most languages use them do not fit under the formal definition of "regular expressions". So... maybe that's not such a good idea.

The formal definitions offer great approximations of how they're used in practice. You learn about what, lazy/greedy matches and backreferences and you're all set? Do you honestly think understanding formal languages will somehow make it harder to learn regular expressions?

nielsm
Jun 1, 2009



I think the problem is that "regular expressions" as most languages implement them are not all that regular. It can be a problem when the time to match a string goes from O(n) to O(n^2) because your "regular expression" engine turns out to not actually use DFMs.
It'd be nice to replace those irregular expressions with regular expressions and parser generators... one can wish.

Incoherence
May 22, 2004

POYO AND TEAR

defmacro posted:

The formal definitions offer great approximations of how they're used in practice. You learn about what, lazy/greedy matches and backreferences and you're all set? Do you honestly think understanding formal languages will somehow make it harder to learn regular expressions?
I was reacting to the implication that if you'd just read an automata textbook, you'd understand PCRE. You might know what some of the symbols mean, but you'd have an inadequate understanding of the power of a "modern regex implementation" because the things you mentioned aren't "regular expressions" in the formal sense. Imagine if you played a game of chess without knowing what the knight did, or even knowing that it existed.

Lysidas
Jul 26, 2002

John Diefenbaker is a madman who thinks he's John Diefenbaker.
Pillbug

nielsm posted:

from O(n) to O(n^2)

It seems to be worse than that when you allow backreferences

Rusty Krustyman
Dec 5, 2002

WARNING: Wearing this jersey may result in serious injury!
I see a lot of this:

code:
if (someBoolean != FALSE) {
now see, if that were consistent everywhere, the whole "not equal false" thing wouldn't be too bad. but it's not. same file you'll get the much happier "someBoolean == TRUE", which is fine. what kills me, though, is the occasional "== FALSE" right near a "!= FALSE". When I'm staring at the debugger and attempting to figure out the code path, I'll occasionally mix up a "!=" and "==". It drives me nuts.

Mustach
Mar 2, 2003

In this long line, there's been some real strange genes. You've got 'em all, with some extras thrown in.

Lysidas posted:

It seems to be worse than that when you allow backreferences
Also, everybody should read this series: http://swtch.com/~rsc/regexp/regexp1.html

Opinion Haver
Apr 9, 2007

Mustach posted:

Also, everybody should read this series: http://swtch.com/~rsc/regexp/regexp1.html

Huh, this looks interesting. Any idea why the mentioned languages don't use his approach?

Zombywuf
Mar 29, 2008

yaoi prophet posted:

Huh, this looks interesting. Any idea why the mentioned languages don't use his approach?

Look at the graph for the part before the Perl one goes crazy. It's orders of magnitude slower for the non pathological case. It also doesn't do back references. I think the only way this could be faster in the general case is that you could potentially do some bit-twiddling to match a whole SIMD register's width of regexes at the same time, maybe.

Malloc Voidstar
May 7, 2007

Fuck the cowboys. Unf. Fuck em hard.

Zombywuf posted:

Look at the graph for the part before the Perl one goes crazy. It's orders of magnitude slower for the non pathological case.
Important note: The Perl graph's y-axis is seconds, the Thompson NFA's y-axis is microseconds.

Also see:

nielsm
Jun 1, 2009



Zombywuf posted:

Look at the graph for the part before the Perl one goes crazy.

Remember that the two graphs on the top have different vertical scales. The Perl one measures seconds, the DFM one measures microseconds.

tef
May 30, 2004

-> some l-system crap ->

yaoi prophet posted:

Huh, this looks interesting. Any idea why the mentioned languages don't use his approach?

it's henry bakers fault

tef
May 30, 2004

-> some l-system crap ->

DIW posted:

Not having industry standard regex syntax is a horror.

:colbert: POSIX :colbert:

tef
May 30, 2004

-> some l-system crap ->
edit: read thread

Cocoa Crispies
Jul 20, 2001

Vehicular Manslaughter!

Pillbug

nielsm posted:

It'd be nice to replace those irregular expressions with regular expressions and parser generators... one can wish.

This is generally what I do. I find https://github.com/bkerley/crapshoot/blob/master/lib/crapshoot/parser/scan.rl#L19-%23L30 to be more readable than a regexp that could parse the same thing.

Scaevolus
Apr 16, 2007

Zombywuf posted:

Look at the graph for the part before the Perl one goes crazy. It's orders of magnitude slower for the non pathological case. It also doesn't do back references. I think the only way this could be faster in the general case is that you could potentially do some bit-twiddling to match a whole SIMD register's width of regexes at the same time, maybe.
People have already corrected you on your speed assumptions, but note that Google developed the re2 library because they needed to be able to run regexes against database columns with predictable performance.

Adbot
ADBOT LOVES YOU

tef
May 30, 2004

-> some l-system crap ->

Scaevolus posted:

People have already corrected you on your speed assumptions, but note that Google developed the re2 library because they needed to be able to run regexes against database columns with predictable performance.

in specific: bounded worst case performance

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply