|
qntm posted:I'm going to go with .toLowerCase().length as my personal highlight here. I'm trying to decide if there's a unicode case where lowering can change length. Maybe ß?
|
# ? Oct 15, 2015 19:54 |
|
|
# ? Jun 5, 2024 10:54 |
|
Arabic scripts have some weird rules so maybe there. But that's mostly character joining, dunno about casing.
|
# ? Oct 15, 2015 19:56 |
|
Mogomra posted:Not the most horrifying thing I've ever seen, but certainly What the gently caress lol. There sheer number of toLowerCase() calls is just murderous. I also had to stop myself from writing that with a ternary operator.
|
# ? Oct 15, 2015 20:36 |
|
Subjunctive posted:I'm trying to decide if there's a unicode case where lowering can change length. Maybe ß? Hmm, it looks like setting ß to upper case yields "SS" but the reverse isn't true. I think I'm correct, although mainly by coincidence.
|
# ? Oct 15, 2015 20:39 |
|
Subjunctive posted:I'm trying to decide if there's a unicode case where lowering can change length. Maybe ß? e: qntm posted:Hmm, it looks like setting ß to upper case yields "SS" but the reverse isn't true. I think I'm correct, although mainly by coincidence. Bonfire Lit fucked around with this message at 20:43 on Oct 15, 2015 |
# ? Oct 15, 2015 20:40 |
|
qntm posted:Hmm, it looks like setting ß to upper case yields "SS" but the reverse isn't true. I think I'm correct, although mainly by coincidence. That makes sense about the reverse, since the linguistic rules in German about "ss" vs. "ß" are far more complex than a 1:1 mapping.
|
# ? Oct 15, 2015 20:53 |
|
Just to clarify about the ß stuff: All of the *_websrc properties are AWS S3 urls, and are pretty much guaranteed to not have magical characters that multiply when they're upper/lowercased.
|
# ? Oct 15, 2015 21:00 |
|
omeg posted:Arabic scripts have some weird rules so maybe there. But that's mostly character joining, dunno about casing. nah, arabic doesn't have casing, it's mostly only found in european alphabets
|
# ? Oct 15, 2015 21:04 |
|
Case folding does ß -> ss so that ß.toUpperCase().toFoldCase() == ß.toFoldCase().
|
# ? Oct 15, 2015 21:27 |
|
Mogomra posted:Not the most horrifying thing I've ever seen, but certainly .jpg and .gif have a dot, but ttf and otf don't.
|
# ? Oct 15, 2015 21:48 |
|
Correct me if I'm wrong, but I believe that there is exactly one Unicode character that becomes more than one character on lowercasing, U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE, also known as "the Turkish I".
|
# ? Oct 15, 2015 22:25 |
|
toLowerCase() is floor() for strings.
|
# ? Oct 15, 2015 22:46 |
|
Unicode is strings with traps in
|
# ? Oct 15, 2015 22:51 |
|
I wish they hadn't invented Unicode. We should have kept "plain text".
|
# ? Oct 15, 2015 22:57 |
|
How naïve.
|
# ? Oct 15, 2015 22:59 |
|
|
# ? Oct 15, 2015 23:01 |
|
The headaches involved in working with legacy encodings is a strict superset of the headaches of dealing with unicode.
|
# ? Oct 15, 2015 23:10 |
|
the fundamental mistake of unicode is trying to make computers handle every single traditional writing system at the same time. it's gonna hurt. particularly when you come onto hosed-up poo poo like arabic that needs to be written in several different directions at the same time. seriously read the unicode bidi algorithm some time if you want your head to explode. the correct answer would, of course, have been for everyone to standardize on ASCII for text. if you want another language, romanize it. if you want to represent another script because you think it looks nice or because you want to preserve your cultural heritage, use a bitmap. problem solved. now, as for date/time APIs, clearly the optimal solution is to adopt swatch internet time, because
|
# ? Oct 15, 2015 23:35 |
|
Mogomra posted:Not the most horrifying thing I've ever seen, but certainly I love ReSharper because with the click of a button I can make poo poo like this readable.
|
# ? Oct 15, 2015 23:45 |
|
Actually, the correct answer would have been to define a new, language-agnostic regular alphabet and make people use that
|
# ? Oct 15, 2015 23:47 |
|
Asymmetrikon posted:Actually, the correct answer would have been to define a new, language-agnostic regular alphabet and make people use that it's called ascii
|
# ? Oct 15, 2015 23:50 |
|
Asymmetrikon posted:Actually, the correct answer would have been to define a new, language-agnostic regular alphabet and make people use that Isn't that APL?
|
# ? Oct 16, 2015 00:04 |
|
(wingdings)
|
# ? Oct 16, 2015 00:16 |
|
Andy Dufresne posted:I love ReSharper because with the click of a button I can make poo poo like this readable. How does resharper know that .toLowerCase or such don't have side-effects? Or does it require you to keep track of that?
|
# ? Oct 16, 2015 00:22 |
|
Hammerite posted:Correct me if I'm wrong, but I believe that there is exactly one Unicode character that becomes more than one character on lowercasing, U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE, also known as "the Turkish I". Looks like you're correct. LATIN CAPITAL LETTER I WITH DOT ABOVE becomes i̇ on lowercasing, which is U+0069 LATIN SMALL LETTER I followed by the very last thing you would expect, U+0307 COMBINING DOT ABOVE. Asymmetrikon posted:Actually, the correct answer would have been to define a new, language-agnostic regular alphabet and make people use that Marain!
|
# ? Oct 16, 2015 00:27 |
|
Subjunctive posted:How does resharper know that .toLowerCase or such don't have side-effects? Or does it require you to keep track of that? I think it would leave those in, but change the formatting dramatically.
|
# ? Oct 16, 2015 00:33 |
|
qntm posted:Looks like you're correct. LATIN CAPITAL LETTER I WITH DOT ABOVE becomes i̇ on lowercasing, which is U+0069 LATIN SMALL LETTER I followed by the very last thing you would expect, U+0307 COMBINING DOT ABOVE. that's still just one character
|
# ? Oct 16, 2015 00:38 |
|
Plorkyeran posted:that's still just one character That may or may not mean that it only contributes 1 to the string's length, because JS is pretty awesome that way. See also surrogates.
|
# ? Oct 16, 2015 00:40 |
|
Asymmetrikon posted:Actually, the correct answer would have been to define a new, language-agnostic regular alphabet and make people use that IPA.
|
# ? Oct 16, 2015 00:55 |
|
Subjunctive posted:That may or may not mean that it only contributes 1 to the string's length, because JS is pretty awesome that way. See also surrogates. I'm not aware of any language where .length actually gives the number of characters. The languages that actually try generally end up going the route Swift did with no generic length/count/etc. property and making you explicitly say what sort of thing you want the count of.
|
# ? Oct 16, 2015 01:15 |
|
first, define "character"
|
# ? Oct 16, 2015 01:18 |
|
Soricidus posted:first, define "character" doing the right thing, even when nobody's looking
|
# ? Oct 16, 2015 01:24 |
|
That definitely doesn't sound like javascript
|
# ? Oct 16, 2015 01:25 |
|
Soricidus posted:That definitely doesn't sound like javascript Well javascript counts UTF-16 code units, which is an answer guaranteed to be wrong for pretty much all uses.
|
# ? Oct 16, 2015 01:30 |
|
Soricidus posted:first, define "character" yes, that's what leads to the swift/perl6 solution
|
# ? Oct 16, 2015 01:36 |
Soricidus posted:the correct answer would, of course, have been for everyone to standardize on ASCII for text. if you want another language, romanize it. if you want to represent another script because you think it looks nice or because you want to preserve your cultural heritage, use a bitmap. problem solved. When people say things like this, I can never tell if they are being serious.
|
|
# ? Oct 16, 2015 01:43 |
|
Soricidus posted:first, define "character" A series of ones and zeroes, but not too many zeroes, just a moderate amount.
|
# ? Oct 16, 2015 02:46 |
|
Dylan16807 posted:Well javascript counts UTF-16 code units, which is an answer guaranteed to be wrong for pretty much all uses. It's all about the UTF-EBCDIC.
|
# ? Oct 16, 2015 03:20 |
|
I forget if it's V8 or SpiderMonkey that has an "encode to UTF8" function, but doesn't special casing surrogate pairs, so it really gives you CESU-8. Even the incredibly sane opinion of "just always use UTF8" is nearly impossible to implement.
|
# ? Oct 16, 2015 04:24 |
|
|
# ? Jun 5, 2024 10:54 |
|
As always, the best solution is to stop trying to do text transformation if you can't actually know anything about the text. Doing weird concatenation, length checks, upper/lowercasing and the like is total nonsense for the (very broad) concept of general human text, and is usually not what you actually want. In the horror we are talking about, it's seems pretty clear that he just wants a non-case-sensitive comparison, which can be done in a much saner way without actually transforming text. It's still fraught with peril, though, and in this case unnecessary: he probably shouldn't treat a URL as arbitrary human text, but just as a bunch of ASCII characters. Basically, my point is this: correct text transformation is in the general case hard for the same reason that natural language processing is hard. It's AI-hard. Stop trying to do it. It's not what you want to do anyway. See also: programmers making assumptions about names. These should also be seen as black boxes.
|
# ? Oct 16, 2015 05:44 |