|
PHP is deprecated
|
# ? Apr 12, 2011 14:14 |
|
|
# ? May 14, 2024 07:43 |
|
I hate to interrupt the thread's bi-weekly PHP discussion and all... So for those of you unfortunate enough to know what Apple WebObjects is, I maintain an application that runs on that. (In short, it's a super-old Java-based web app framework). There's an add-on for it called "Project Wonder", which basically adds a ton of poo poo on top of WO that ranges anywhere from little use to super useful. I was poking around in the source for their localizer class, seeing what rules it used to automatically generate the plural forms of strings if none was provided: code:
Oh, and a horror hidden in a horror: more than one "octopus" is not "octopi"!
|
# ? Apr 12, 2011 14:51 |
|
That entire rule is bogus. The plural of "virus" isn't "viri".
|
# ? Apr 12, 2011 15:27 |
|
Why does a localizer class do any form of pluralization?
|
# ? Apr 12, 2011 15:51 |
|
And why does it call it 'plurify'ing?
|
# ? Apr 12, 2011 16:23 |
|
boxen? goddamn nerds.
|
# ? Apr 12, 2011 16:42 |
|
TRex EaterofCars posted:boxen? goddamn nerds. I hate that word, too, but that expression doesn't match box.
|
# ? Apr 12, 2011 17:45 |
|
code:
e: gently caress, didn't notice $
|
# ? Apr 12, 2011 17:48 |
|
Munkeymon posted:I hate that word, too, but that expression doesn't match box. You're right. What is that (.*) doing in front of the line-start symbol though? That's what threw me off.
|
# ? Apr 12, 2011 17:52 |
|
TRex EaterofCars posted:You're right. What is that (.*) doing in front of the line-start symbol though? That's what threw me off. . Any character (may or may not match line terminators) X* X, zero or more times So, anything zero or more times, basically a pure wildcard. I assume strings are tokenized before calling these functions.
|
# ? Apr 12, 2011 18:07 |
|
Geekner posted:http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html I know the regular expression syntax, but what possible input could be matched by that first group?
|
# ? Apr 12, 2011 18:32 |
|
Geekner posted:http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html I think you missed the point. ^ here's the point, right here ^
|
# ? Apr 12, 2011 18:33 |
|
Flobbster posted:I mean that's cool and all, but I chuckled at which of the words they deemed important enough to have special cases, like "buffalo" and "tomato", but "potato" gets left out.
|
# ? Apr 12, 2011 18:46 |
|
TRex EaterofCars posted:You're right. What is that (.*) doing in front of the line-start symbol though? That's what threw me off. It looks like someone went in there with the intent to add a special case for ox, decided to copy and paste a line from the section of the function where expressions that operate based on what the ends of words look like rather than copying and pasting one based on replacing only whole words and then they or someone else 'fixed' it by sticking the beginning of line anchor in an odd spot when they realized they were creating 'boxen' as well as 'oxen' and that people would rightly mock them for unironically using the word 'boxen'. It works because (.*) can match nothing (as in, a zero-length string) without error. At least, that's the best way I can think to put it. I'm sure someone will correct me if I'm not stating it perfectly.
|
# ? Apr 12, 2011 20:01 |
|
Munkeymon posted:It works because (.*) can match nothing (as in, a zero-length string) without error. At least, that's the best way I can think to put it. I'm sure someone will correct me if I'm not stating it perfectly. Dot is "any one character", star is "zero or more times". So the empty string is indeed matched by .*.
|
# ? Apr 12, 2011 22:55 |
|
You see, the . is like the joker in a deck of cards, it's a wildcard that can represent any other card or "character". The * is like a photocopier, it can take a single document and make any number of copies of that document, only this photocopier has a build in shredder that can destroy the original document leaving you with no documents, or continuing the analogy "characters".
|
# ? Apr 12, 2011 23:17 |
|
. accepts any character, * is a Kleene star closure, your automata theory textbook should cover all this.
|
# ? Apr 13, 2011 01:23 |
|
I understand the DFA this thing makes. What I thought I was missing was some actual bona fide reason to have a capture group expression before the ^.
|
# ? Apr 13, 2011 01:34 |
|
If it was a multi-line regex, it could potentially not be empty, provided ox was handled completely different from every other token. See: Using ^ and $ as Start of Line and End of Line Anchors But, they're not multiline regexes, somebody just doesn't know how to regex. Also unless I missed something, they'll pluralize "table" as "tablels" (e: oh derp mistook 1 for l. whatever, it's still a dumb idea). wellwhoopdedooo fucked around with this message at 02:37 on Apr 13, 2011 |
# ? Apr 13, 2011 02:35 |
|
pokeyman posted:Dot is "any one character", star is "zero or more times". So the empty string is indeed matched by .*. Beyond that any string is matched by .* because it eagerly evaluates. Even if you took out the ^ in the middle, ^(.*)(ox)$ can never match anything.
|
# ? Apr 13, 2011 03:23 |
|
1337JiveTurkey posted:Beyond that any string is matched by .* because it eagerly evaluates. Even if you took out the ^ in the middle, ^(.*)(ox)$ can never match anything. Only if your regex implementation is dumb and doesn't backtrack for some reason: code:
|
# ? Apr 13, 2011 03:26 |
|
You have to explicitly tell the standard Java regex to backtrack with (.*?) because it can significantly affect performance.
|
# ? Apr 13, 2011 03:48 |
|
1337JiveTurkey posted:You have to explicitly tell the standard Java regex to backtrack with (.*?) because it can significantly affect performance. I've had to do a lot of work with regexes in Python and Java, and seemingly mysterious regex failures in Java stumped me for a while. Not having industry standard regex syntax is a horror.
|
# ? Apr 13, 2011 04:00 |
|
BonzoESC posted:. accepts any character, * is a Kleene star closure, your automata theory textbook should cover all this.
|
# ? Apr 13, 2011 04:18 |
|
Incoherence posted:And then you read your automata theory textbook a little more closely and discover that regexps in the sense that most languages use them do not fit under the formal definition of "regular expressions". So... maybe that's not such a good idea. The formal definitions offer great approximations of how they're used in practice. You learn about what, lazy/greedy matches and backreferences and you're all set? Do you honestly think understanding formal languages will somehow make it harder to learn regular expressions?
|
# ? Apr 13, 2011 04:31 |
I think the problem is that "regular expressions" as most languages implement them are not all that regular. It can be a problem when the time to match a string goes from O(n) to O(n^2) because your "regular expression" engine turns out to not actually use DFMs. It'd be nice to replace those irregular expressions with regular expressions and parser generators... one can wish.
|
|
# ? Apr 13, 2011 04:45 |
|
defmacro posted:The formal definitions offer great approximations of how they're used in practice. You learn about what, lazy/greedy matches and backreferences and you're all set? Do you honestly think understanding formal languages will somehow make it harder to learn regular expressions?
|
# ? Apr 13, 2011 04:57 |
|
nielsm posted:from O(n) to O(n^2) It seems to be worse than that when you allow backreferences
|
# ? Apr 13, 2011 04:58 |
|
I see a lot of this:code:
|
# ? Apr 13, 2011 06:47 |
|
Lysidas posted:It seems to be worse than that when you allow backreferences
|
# ? Apr 13, 2011 12:52 |
|
Mustach posted:Also, everybody should read this series: http://swtch.com/~rsc/regexp/regexp1.html Huh, this looks interesting. Any idea why the mentioned languages don't use his approach?
|
# ? Apr 13, 2011 14:30 |
|
yaoi prophet posted:Huh, this looks interesting. Any idea why the mentioned languages don't use his approach? Look at the graph for the part before the Perl one goes crazy. It's orders of magnitude slower for the non pathological case. It also doesn't do back references. I think the only way this could be faster in the general case is that you could potentially do some bit-twiddling to match a whole SIMD register's width of regexes at the same time, maybe.
|
# ? Apr 13, 2011 14:45 |
|
Zombywuf posted:Look at the graph for the part before the Perl one goes crazy. It's orders of magnitude slower for the non pathological case. Also see:
|
# ? Apr 13, 2011 14:47 |
Zombywuf posted:Look at the graph for the part before the Perl one goes crazy. Remember that the two graphs on the top have different vertical scales. The Perl one measures seconds, the DFM one measures microseconds.
|
|
# ? Apr 13, 2011 14:49 |
|
yaoi prophet posted:Huh, this looks interesting. Any idea why the mentioned languages don't use his approach? it's henry bakers fault
|
# ? Apr 13, 2011 14:56 |
|
DIW posted:Not having industry standard regex syntax is a horror. POSIX
|
# ? Apr 13, 2011 15:00 |
|
edit: read thread
|
# ? Apr 13, 2011 15:05 |
|
nielsm posted:It'd be nice to replace those irregular expressions with regular expressions and parser generators... one can wish. This is generally what I do. I find https://github.com/bkerley/crapshoot/blob/master/lib/crapshoot/parser/scan.rl#L19-%23L30 to be more readable than a regexp that could parse the same thing.
|
# ? Apr 13, 2011 15:32 |
|
Zombywuf posted:Look at the graph for the part before the Perl one goes crazy. It's orders of magnitude slower for the non pathological case. It also doesn't do back references. I think the only way this could be faster in the general case is that you could potentially do some bit-twiddling to match a whole SIMD register's width of regexes at the same time, maybe.
|
# ? Apr 13, 2011 17:06 |
|
|
# ? May 14, 2024 07:43 |
|
Scaevolus posted:People have already corrected you on your speed assumptions, but note that Google developed the re2 library because they needed to be able to run regexes against database columns with predictable performance. in specific: bounded worst case performance
|
# ? Apr 13, 2011 17:29 |