Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Subjunctive
Sep 12, 2006

✨sparkle and shine✨

qntm posted:

I'm going to go with .toLowerCase().length as my personal highlight here.

I'm trying to decide if there's a unicode case where lowering can change length. Maybe ß?

Adbot
ADBOT LOVES YOU

omeg
Sep 3, 2012

Arabic scripts have some weird rules so maybe there. But that's mostly character joining, dunno about casing.

ChubbyThePhat
Dec 22, 2006

Who nico nico needs anyone else

Mogomra posted:

Not the most horrifying thing I've ever seen, but certainly :pwn:

code:
for (var i = 0; i < 26; i++) {
	if (typeof request.app.requestData.fields['image_' + i + '_websrc'] !== 'undefined') {
		if (request.app.requestData.fields['image_' + i + '_websrc'].length > 0) {
			fileCountToGrab++;
			var VALID_EXTS = [".jpg", ".jpeg", ".bmp", ".png", ".gif", "ttf", "otf"];
			if (VALID_EXTS.indexOf(request.app.requestData.fields['image_' + i + '_websrc'].toLowerCase().substring(request.app.requestData.fields['image_' + i + '_websrc'].toLowerCase().length - 4).toLowerCase()) == -1 && VALID_EXTS.indexOf(request.app.requestData.fields['image_' + i + '_websrc'].toLowerCase().substring(request.app.requestData.fields['image_' + i + '_websrc'].toLowerCase().length - 5).toLowerCase()) == -1) {
				return false;
			}
		}
	}
}

What the gently caress lol. There sheer number of toLowerCase() calls is just murderous. I also had to stop myself from writing that with a ternary operator.

qntm
Jun 17, 2009

Subjunctive posted:

I'm trying to decide if there's a unicode case where lowering can change length. Maybe ß?

Hmm, it looks like setting ß to upper case yields "SS" but the reverse isn't true. I think I'm correct, although mainly by coincidence.

Bonfire Lit
Jul 9, 2008

If you're one of the sinners who caused this please unfriend me now.

Subjunctive posted:

I'm trying to decide if there's a unicode case where lowering can change length. Maybe ß?
ß already is lower-case. (The capital form is ẞ and nobody uses it.)

e:

qntm posted:

Hmm, it looks like setting ß to upper case yields "SS" but the reverse isn't true. I think I'm correct, although mainly by coincidence.
Yeah, ß has no traditional capital form because it never occurs at the beginning of a word. If you're type-setting allcaps, you're supposed to use SS unless you have to preserve the exact spelling of a proper name.

Bonfire Lit fucked around with this message at 20:43 on Oct 15, 2015

Flobbster
Feb 17, 2005

"Cadet Kirk, after the way you cheated on the Kobayashi Maru test I oughta punch you in tha face!"

qntm posted:

Hmm, it looks like setting ß to upper case yields "SS" but the reverse isn't true. I think I'm correct, although mainly by coincidence.

That makes sense about the reverse, since the linguistic rules in German about "ss" vs. "ß" are far more complex than a 1:1 mapping.

Mogomra
Nov 5, 2005

simply having a wonderful time
Just to clarify about the ß stuff:

All of the *_websrc properties are AWS S3 urls, and are pretty much guaranteed to not have magical characters that multiply when they're upper/lowercased.

Soricidus
Oct 21, 2010
freedom-hating statist shill

omeg posted:

Arabic scripts have some weird rules so maybe there. But that's mostly character joining, dunno about casing.

nah, arabic doesn't have casing, it's mostly only found in european alphabets

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed
Case folding does ß -> ss so that ß.toUpperCase().toFoldCase() == ß.toFoldCase().

Carbon dioxide
Oct 9, 2012

Mogomra posted:

Not the most horrifying thing I've ever seen, but certainly :pwn:

code:
for (var i = 0; i < 26; i++) {
	if (typeof request.app.requestData.fields['image_' + i + '_websrc'] !== 'undefined') {
		if (request.app.requestData.fields['image_' + i + '_websrc'].length > 0) {
			fileCountToGrab++;
			var VALID_EXTS = [".jpg", ".jpeg", ".bmp", ".png", ".gif", "ttf", "otf"];
			if (VALID_EXTS.indexOf(request.app.requestData.fields['image_' + i + '_websrc'].toLowerCase().substring(request.app.requestData.fields['image_' + i + '_websrc'].toLowerCase().length - 4).toLowerCase()) == -1 && VALID_EXTS.indexOf(request.app.requestData.fields['image_' + i + '_websrc'].toLowerCase().substring(request.app.requestData.fields['image_' + i + '_websrc'].toLowerCase().length - 5).toLowerCase()) == -1) {
				return false;
			}
		}
	}
}

.jpg and .gif have a dot, but ttf and otf don't.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
Correct me if I'm wrong, but I believe that there is exactly one Unicode character that becomes more than one character on lowercasing, U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE, also known as "the Turkish I".

Karate Bastard
Jul 31, 2007

Soiled Meat
toLowerCase() is floor() for strings.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
Unicode is strings with traps in

Karate Bastard
Jul 31, 2007

Soiled Meat
I wish they hadn't invented Unicode. We should have kept "plain text".

Subjunctive
Sep 12, 2006

✨sparkle and shine✨

How naïve.

Karate Bastard
Jul 31, 2007

Soiled Meat
:cheeky:

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed
The headaches involved in working with legacy encodings is a strict superset of the headaches of dealing with unicode.

Soricidus
Oct 21, 2010
freedom-hating statist shill
the fundamental mistake of unicode is trying to make computers handle every single traditional writing system at the same time. it's gonna hurt. particularly when you come onto hosed-up poo poo like arabic that needs to be written in several different directions at the same time. seriously read the unicode bidi algorithm some time if you want your head to explode.

the correct answer would, of course, have been for everyone to standardize on ASCII for text. if you want another language, romanize it. if you want to represent another script because you think it looks nice or because you want to preserve your cultural heritage, use a bitmap. problem solved.

now, as for date/time APIs, clearly the optimal solution is to adopt swatch internet time, because

Andy Dufresne
Aug 4, 2010

The only good race pace is suicide pace, and today looks like a good day to die

Mogomra posted:

Not the most horrifying thing I've ever seen, but certainly :pwn:

code:
for (var i = 0; i < 26; i++) {
	if (typeof request.app.requestData.fields['image_' + i + '_websrc'] !== 'undefined') {
		if (request.app.requestData.fields['image_' + i + '_websrc'].length > 0) {
			fileCountToGrab++;
			var VALID_EXTS = [".jpg", ".jpeg", ".bmp", ".png", ".gif", "ttf", "otf"];
			if (VALID_EXTS.indexOf(request.app.requestData.fields['image_' + i + '_websrc'].toLowerCase().substring(request.app.requestData.fields['image_' + i + '_websrc'].toLowerCase().length - 4).toLowerCase()) == -1 && VALID_EXTS.indexOf(request.app.requestData.fields['image_' + i + '_websrc'].toLowerCase().substring(request.app.requestData.fields['image_' + i + '_websrc'].toLowerCase().length - 5).toLowerCase()) == -1) {
				return false;
			}
		}
	}
}

I love ReSharper because with the click of a button I can make poo poo like this readable.

Asymmetrikon
Oct 30, 2009

I believe you're a big dork!
Actually, the correct answer would have been to define a new, language-agnostic regular alphabet and make people use that

b0lt
Apr 29, 2005

Asymmetrikon posted:

Actually, the correct answer would have been to define a new, language-agnostic regular alphabet and make people use that

it's called ascii

canis minor
May 4, 2011

Asymmetrikon posted:

Actually, the correct answer would have been to define a new, language-agnostic regular alphabet and make people use that

Isn't that APL?

Asymmetrikon
Oct 30, 2009

I believe you're a big dork!


(wingdings)

Subjunctive
Sep 12, 2006

✨sparkle and shine✨

Andy Dufresne posted:

I love ReSharper because with the click of a button I can make poo poo like this readable.

How does resharper know that .toLowerCase or such don't have side-effects? Or does it require you to keep track of that?

qntm
Jun 17, 2009

Hammerite posted:

Correct me if I'm wrong, but I believe that there is exactly one Unicode character that becomes more than one character on lowercasing, U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE, also known as "the Turkish I".

Looks like you're correct. LATIN CAPITAL LETTER I WITH DOT ABOVE becomes i̇ on lowercasing, which is U+0069 LATIN SMALL LETTER I followed by the very last thing you would expect, U+0307 COMBINING DOT ABOVE.

Asymmetrikon posted:

Actually, the correct answer would have been to define a new, language-agnostic regular alphabet and make people use that

Marain!

necrotic
Aug 2, 2005
I owe my brother big time for this!

Subjunctive posted:

How does resharper know that .toLowerCase or such don't have side-effects? Or does it require you to keep track of that?

I think it would leave those in, but change the formatting dramatically.

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed

qntm posted:

Looks like you're correct. LATIN CAPITAL LETTER I WITH DOT ABOVE becomes i̇ on lowercasing, which is U+0069 LATIN SMALL LETTER I followed by the very last thing you would expect, U+0307 COMBINING DOT ABOVE.

that's still just one character

Subjunctive
Sep 12, 2006

✨sparkle and shine✨

Plorkyeran posted:

that's still just one character

That may or may not mean that it only contributes 1 to the string's length, because JS is pretty awesome that way. See also surrogates.

fritz
Jul 26, 2003

Asymmetrikon posted:

Actually, the correct answer would have been to define a new, language-agnostic regular alphabet and make people use that

IPA.

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed

Subjunctive posted:

That may or may not mean that it only contributes 1 to the string's length, because JS is pretty awesome that way. See also surrogates.

I'm not aware of any language where .length actually gives the number of characters. The languages that actually try generally end up going the route Swift did with no generic length/count/etc. property and making you explicitly say what sort of thing you want the count of.

Soricidus
Oct 21, 2010
freedom-hating statist shill
first, define "character"

Blotto Skorzany
Nov 7, 2008

He's a PSoC, loose and runnin'
came the whisper from each lip
And he's here to do some business with
the bad ADC on his chip
bad ADC on his chiiiiip

Soricidus posted:

first, define "character"

doing the right thing, even when nobody's looking

Soricidus
Oct 21, 2010
freedom-hating statist shill
That definitely doesn't sound like javascript

Dylan16807
May 12, 2010

Soricidus posted:

That definitely doesn't sound like javascript

Well javascript counts UTF-16 code units, which is an answer guaranteed to be wrong for pretty much all uses.

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed

Soricidus posted:

first, define "character"

yes, that's what leads to the swift/perl6 solution

VikingofRock
Aug 24, 2008




Soricidus posted:

the correct answer would, of course, have been for everyone to standardize on ASCII for text. if you want another language, romanize it. if you want to represent another script because you think it looks nice or because you want to preserve your cultural heritage, use a bitmap. problem solved.

When people say things like this, I can never tell if they are being serious.

Pavlov
Oct 21, 2012

I've long been fascinated with how the alt-right develops elaborate and obscure dog whistles to try to communicate their meaning without having to say it out loud
Stepan Andreyevich Bandera being the most prominent example of that

Soricidus posted:

first, define "character"

A series of ones and zeroes, but not too many zeroes, just a moderate amount.

Kazinsal
Dec 13, 2011

Dylan16807 posted:

Well javascript counts UTF-16 code units, which is an answer guaranteed to be wrong for pretty much all uses.

It's all about the UTF-EBCDIC.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
I forget if it's V8 or SpiderMonkey that has an "encode to UTF8" function, but doesn't special casing surrogate pairs, so it really gives you CESU-8. Even the incredibly sane opinion of "just always use UTF8" is nearly impossible to implement.

Adbot
ADBOT LOVES YOU

Athas
Aug 6, 2007

fuck that joker
As always, the best solution is to stop trying to do text transformation if you can't actually know anything about the text. Doing weird concatenation, length checks, upper/lowercasing and the like is total nonsense for the (very broad) concept of general human text, and is usually not what you actually want. In the horror we are talking about, it's seems pretty clear that he just wants a non-case-sensitive comparison, which can be done in a much saner way without actually transforming text. It's still fraught with peril, though, and in this case unnecessary: he probably shouldn't treat a URL as arbitrary human text, but just as a bunch of ASCII characters.

Basically, my point is this: correct text transformation is in the general case hard for the same reason that natural language processing is hard. It's AI-hard. Stop trying to do it. It's not what you want to do anyway. See also: programmers making assumptions about names. These should also be seen as black boxes.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply