p-lang thread: (now (have you (problems two)))

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > p-lang thread: (now (have you (problems two)))

«‹›1784 »

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

cinci zoo sniper posted:

i see. i was aware of the tool but never had to use it. sounds like a really stupid think to gently caress up if it was a first party thing, which i assume it is

re: your other post, what problems do you face with filenames in py3 that weren't there in py2?

it was a really stupid thing to gently caress up

also python decided that all filenames are unicode. full stop. on posix this means that filenames are interpreted based on POSIX locale on Linux. also command line arguments too. yes, POSIX is hosed up when it comes to character codes, but your decree by law doesn't make things better, it just makes things worse since all the problems of POSIX are still there.

my favorite part of this is that for a long time, the built-in zipfile module also translated filenames based off of the system locale, meaning that the same zip file could have different filenames on different systems with different character code settings. also there is no way to access the raw bytes of the filename. i think they eventually added this.

the same thing happened with command line arguments too.

# ? Nov 26, 2017 01:18

Adbot: ADBOT LOVES YOU

# ? May 17, 2024 07:32

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

also file io is ridiculous. the "b" flag to open in python3 drastically changes the return value from unicode string to bytestring (a completely different datastructure which is more like an array of integers rather than a string), and there is no way to test what kind of file you were handed, other than to read 0 bytes and test the return value. in previous versions all this did was change whether '\n' was translated to '\r\n' on windows when writing files, iirc.

i think if you pass a "bytestring file" to several built-in python modules it crashes because it assumes the return value of file.read is a unicode string.

this goes into rjmccall's point about static typing: there's this murky area here where now there's two different file types: one that has its api use unicode strings and the other bytestrings, and there's no way to tell them apart. there really should actually be two file types.

in practice people stopped using bytestring files.

# ? Nov 26, 2017 01:22

MrMoo: Sep 14, 2000

cinci zoo sniper posted:

i completely dont get the byte-string and text-string difference, or its implications even if its my naive assumption that one is array of bytes, and other is array of chars

1 char ≥ 1 byte, and most importantly variable. When 1 char = 1 byte you can use tools like memcmp, memcpy, trivial offset references for nth characters.

It is frequently why UTF-8 is used a storage and message encoding, because it saves space, but UTF-16 or UTF-32 is used within UI applications because characters become fixed width and offset references become trivial again.

In Python a byte-string picks up your locale encoding which otherwise is only detectable by complicated statistical analysis on text, browsers like Firefox and Chrome do this when presented with non-Unicode text. Frequently fail still at that.

MrMoo fucked around with this message at 01:28 on Nov 26, 2017

# ? Nov 26, 2017 01:23

cinci zoo sniper: Mar 15, 2013

Suspicious Dish posted:

oic, ty. this "b" input thing about bytestrings has me worried now, i should run quickly through the py3 io docs at some point since i ported one of my current projects, that deals with io alot, to py3 and while it seemingly works identically to how it did before, i havent spent too much time testing it thoroughly yet

MrMoo posted:

1 char ≥ 1 byte, and most importantly variable. When 1 char = 1 byte you can use tools like memcmp, memcpy, trivial offset references for nth characters.

oh, right

# ? Nov 26, 2017 01:38

Fiedler: Jun 29, 2002; I, for one, welcome our new mouse overlords.

MrMoo posted:

but UTF-16 or UTF-32 is used within UI applications because characters become fixed width and offset references become trivial again.

UTF-16 is not fixed-width.

# ? Nov 26, 2017 01:45

invlwhen: Jul 28, 2012; please do your best

MrMoo posted:

It is frequently why UTF-8 is used a storage and message encoding, because it saves space, but UTF-16 or UTF-32 is used within UI applications because characters become fixed width and offset references become trivial again.

my dude, let me introduce you to surrogate pairs

# ? Nov 26, 2017 01:45

Doom Mathematic: Sep 2, 2008

More to the point, you almost never actually want to get the Nth character of a text string. The things you want to do with an array of bytes don't overlap all that much with the things you want to do with a piece of text, even though the text is commonly stored as an array of bytes.

# ? Nov 26, 2017 01:52

MrMoo: Sep 14, 2000

Fiedler posted:

UTF-16 is not fixed-width.

I meant UCS-2, but that doesn't stop people calling it UTF-16 :lol:

# ? Nov 26, 2017 01:58

pseudorandom name: May 6, 2007

UCS-2 and UCS-4/UTF-32 aren't fixed width either

# ? Nov 26, 2017 02:03

ComradeCosmobot: Dec 4, 2004; USPOL July

MrMoo posted:

I meant UCS-2, but that doesn't stop people calling it UTF-16

let us tell u about something called combining characters
/
👨‍👨‍👦‍👦

# ? Nov 26, 2017 02:18

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

pseudorandom name posted:

UCS-2 and UCS-4/UTF-32 aren't fixed width either

uhhhh, I thought the whole point of UTF-32 was that it was fixed-width

# ? Nov 26, 2017 02:20

redleader: Aug 18, 2005; Engage according to operational parameters

pseudorandom name posted:

UCS-2 and UCS-4/UTF-32 aren't fixed width either

wait, what?

# ? Nov 26, 2017 02:23

pseudorandom name: May 6, 2007

ComradeCosmobot posted:

let us tell u about something called combining characters
/
👨‍👨‍👦‍👦

please do
/
👩🏾‍💻

# ? Nov 26, 2017 02:24

MrMoo: Sep 14, 2000

Fixed width for code points not characters, ugh. Way too much encoding chat already :gonk:

# ? Nov 26, 2017 02:33

Doom Mathematic: Sep 2, 2008

pseudorandom name posted:

UCS-2 and UCS-4/UTF-32 aren't fixed width either

UCS-2 characters are fixed at width 16 bits, UCS-4/UTF-32 are fixed at 32 bits. Is this wrong?

# ? Nov 26, 2017 02:41

crazypenguin: Mar 9, 2005; nothing witty here, move along

https://mathias.gaunard.com/unicode/doc/html/unicode/introduction_to_unicode.html

no really, if you're confused by that, click that link. I know it says introduction to unicode but it isn't long and it's really just about exactly this point

# ? Nov 26, 2017 02:44

MrMoo: Sep 14, 2000

Easiest with the example from ComradeCosmobot:

👨‍👨‍👦‍👦 (Man, Man, Boy, Boy) is a single Unicode character comprised of four combined characters, seven code points:

Codepoint #1 = Man 👨 = U+1F468
C‍odepoint #2 = Zero-width joiner = U+200D
Codepoint #3 = Man 👨 = U+1F468
‍Codepoint #4 = Zero-width joiner = U+200D
Codepoint #5 = Boy 👦 = U+1F466‍
Codepoint #6 = Zero-width joiner = U+200D
Codepoint #7 = Boy 👦 = U+1F466

Where U + <hex pair> [+ <hex pair>] means the raw Unicode code point you can usually reference in a plang like "\u1f468" .

The raw bytes on MacOS look like this, 26-bytes long:
$ cat unicode.txt 👨‍👨‍👦‍👦 $ hexdump unicode.txt 0000000 f0 9f 91 a8 e2 80 8d f0 9f 91 a8 e2 80 8d f0 9f 0000010 91 a6 e2 80 8d f0 9f 91 a6 0a 000001a

i.e. Unicode is quite complicated and characters can even have multiple meanings due to special modifying codepoints.

MrMoo fucked around with this message at 03:32 on Nov 26, 2017

# ? Nov 26, 2017 02:53

Doom Mathematic: Sep 2, 2008

I don't know what that is meant to be an example of.

# ? Nov 26, 2017 03:10

pseudorandom name: May 6, 2007

Doom Mathematic posted:

I don't know what that is meant to be an example of.

UTF-32 isn�t a fixed width encoding.

# ? Nov 26, 2017 03:48

ComradeCosmobot: Dec 4, 2004; USPOL July

MrMoo posted:

Easiest with the example from ComradeCosmobot:

👨‍👨‍👦‍👦 (Man, Man, Boy, Boy) is a single Unicode character comprised of four combined characters, seven code points:

Codepoint #1 = Man 👨 = U+1F468
C‍odepoint #2 = Zero-width joiner = U+200D
Codepoint #3 = Man 👨 = U+1F468
‍Codepoint #4 = Zero-width joiner = U+200D
Codepoint #5 = Boy 👦 = U+1F466‍
Codepoint #6 = Zero-width joiner = U+200D
Codepoint #7 = Boy 👦 = U+1F466

Where U + <hex pair> [+ <hex pair>] means the raw Unicode code point you can usually reference in a plang like "\u1f468" .

The raw bytes on MacOS look like this, 26-bytes long:
$ cat unicode.txt 👨‍👨‍👦‍👦 $ hexdump unicode.txt 0000000 f0 9f 91 a8 e2 80 8d f0 9f 91 a8 e2 80 8d f0 9f 0000010 91 a6 e2 80 8d f0 9f 91 a6 0a 000001a

i.e. Unicode is quite complicated and characters can even have multiple meanings due to special modifying codepoints.

the best part about that example is that i lied: the example i gave technically has no combining characters and, depending on which emoji zwj sequences your standard library is aware of (if any; they technically aren�t required for a compliant implementation!), might count as 7 characters or 1 character

(a true combining character would involve a codepoint that has a non-zero combining class)

ComradeCosmobot fucked around with this message at 04:13 on Nov 26, 2017

# ? Nov 26, 2017 04:11

redleader: Aug 18, 2005; Engage according to operational parameters

pseudorandom name posted:

UTF-32 isn�t a fixed width encoding.

4 bytes is enough to represent every code point that has been and can be defined by unicode as it currently stands. what am i missing?

# ? Nov 26, 2017 05:31

pokeyman: Nov 26, 2006; That elephant ate my entire platoon.

redleader posted:

4 bytes is enough to represent every code point that has been and can be defined by unicode as it currently stands. what am i missing?

they�re saying it�s not a fixed-width encoding of characters. you�re saying it�s a fixed-worth encoding of code points

I�m not knowledgeable enough to comment on whether this is too pedantic a distinction to make in this context (are the various encoding standards defined by code point or by character and does it matter?), but I can comfortably say that if you�re being too pedantic in a discussion about loving Unicode then maybe it�s time to take a good look in the mirror and ask yourself who you wanna be in this world

# ? Nov 26, 2017 06:11

DELETE CASCADE: Oct 25, 2017; i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003

redleader posted:

4 bytes is enough to represent every code point that has been and can be defined by unicode as it currently stands. what am i missing?

characters can be made up of multiple code points combined (see the post above yours), so just because you can jump up 4 bytes to the next code point doesn't mean you've advanced one character, which is probably what your user wanted to do with the code you wrote, since that's what she saw on the screen

# ? Nov 26, 2017 06:11

rjmccall: Sep 7, 2007; no worries friend; Fun Shoe

the point is that while you can O(1) index to a code point in those encodings, there is no actual semantic operation on strings that takes advantage of that because you have to be aware of the possibility that you've indexed into the middle of a multi-code-point character. you could scan backwards to see if you are but you can do that with utf-8 and utf-16, too

there are simpler examples even in ucs-2, like the combining hangul or (more familiar to westerners) the combining diacritics

# ? Nov 26, 2017 07:45

pseudorandom name: May 6, 2007

there's also fun things like the ﬁ ligature which is a single code point, but in a correctly implemented Unicode text entry field will behave as two characters with respect to Backspace/Delete or cursor movement

# ? Nov 26, 2017 09:09

Malcolm XML: Aug 8, 2009; I always knew it would end like ｔｈｉｓ．

normalized forms oooo

# ? Nov 26, 2017 12:45

gonadic io: Feb 16, 2011; >>=

pseudorandom name posted:

there's also fun things like the ﬁ ligature which is a single code point, but in a correctly implemented Unicode text entry field will behave as two characters with respect to Backspace/Delete or cursor movement

I really like this kind of behaviour, I have my vs code set up with combining ligatures that are still the ascii chars in the file and semantically when editing, but the visually they become the unicode ligatures

# ? Nov 26, 2017 12:57

Carthag Tuek: Oct 15, 2005; Tider skal komme,
tider skal henrulle,
sl�gt skal f�lge sl�gters gang

gonadic io posted:

I really like this kind of behaviour, I have my vs code set up with combining ligatures that are still the ascii chars in the file and semantically when editing, but the visually they become the unicode ligatures

hows it do if you copy to another application? just wondering whats going on in the frontend

# ? Nov 26, 2017 13:11

cinci zoo sniper: Mar 15, 2013

Powaqoatse posted:

hows it do if you copy to another application? just wondering whats going on in the frontend

sounds like he is just using fira code or equivalent

# ? Nov 26, 2017 13:19

gonadic io: Feb 16, 2011; >>=

cinci zoo sniper posted:

sounds like he is just using fira code or equivalent

That's it yeah. It's purely a visual thing, all selection and copy pasting and editing behave as if they were separate characters

# ? Nov 26, 2017 13:32

cinci zoo sniper: Mar 15, 2013

yea i use too. www ligature is extremely annoying but when it finally gets to me ill just look up that russia fira code clone i saw the other day that has bunch of customisation options

# ? Nov 26, 2017 13:39

Workaday Wizard: Oct 23, 2009; by Pragmatica

cinci zoo sniper posted:

yea i use too. www ligature is extremely annoying but when it finally gets to me ill just look up that russia fira code clone i saw the other day that has bunch of customisation options

i could ve sworn the main fira site contains variants without the www ligature

# ? Nov 26, 2017 13:48

Carthag Tuek: Oct 15, 2005; Tider skal komme,
tider skal henrulle,
sl�gt skal f�lge sl�gters gang

oh neat

# ? Nov 26, 2017 15:21

Carthag Tuek: Oct 15, 2005; Tider skal komme,
tider skal henrulle,
sl�gt skal f�lge sl�gters gang

also you can open the ttf/otf files in like fontforge and gently caress with the feature config to disable specific ligatures if you want

# ? Nov 26, 2017 15:25

MononcQc: May 29, 2007

Here's an attempt at a quick run-through of vocabulary and consequences. In ASCII and latin-1 (ISO-8859-N) you mostly can deal with the following words, which are all synonymous:

character
letter
symbol

In some variants, you also have to add the word 'diacritic' or 'accent' which let you modify a character in terms of linguistics:

a + ` = �

Fortunately, in latin-1, most of these naughty diacritics have been bundled into specific characters. In French, for example, this final 'letter' can be represented under a single code (224).

There are however complications coming from that. One of them is 'collation' (the sorting order of letters). For example, a and � in French ought to sort in the same portion of the alphabet (before 'b'), but by default, they end up sorting after 'z'.

In Danish, A is the first letter of the alphabet, but � is last. Also � is seen as a ligature of Aa; Aa is sorted like � rather than two letters 'Aa' one after the other. Swedish has different diacritics with different orders: �, �, �.

Enter UNICODE. To make a matter short (and with a gross oversimplifications), we have the following terms in vocabulary:

character: smallest representable unit in a language, in the abstract. '`' is a character, so is 'a', and so is '�'
glyph: the visual representation of a character. Think of it as a character from the point of view of the font or typeface designer. For example, the same glyph may be used for the capital letter 'pi' and the mathematical symbol for a product: ∏. Similarly, capital 'Sigma' and the mathematical 'sum' may have different character representation, but the same ∑ glyph.
letter: an element of an alphabet
codepoint: A given value in the unicode space. There's a big table with a crapload of characters in them, and every character is assigned a codepoint, as a unique identifier for it.
code unit: a specific encoding of a given code point. This refers to bits, not just the big table. The same code point may have different
code units in UTF-8, UTF-16, and UTF-32, which are 3 'encodings' of unicode.
grapheme: what the user thinks of a 'character'
grapheme cluster: what you want to think of as a 'character' for your user's sake. Basically, 'a' and '`' can be two graphemes, but if I combine them together as '�', I want to be able to say that a single 'delete' key press will remove both the '`' and the 'a' at once from my text, and not be left with one or the other.

The big fun bit is that unicode takes all these really lovely complicated linguistic things and specifies how they should be handled.

Like, what makes two strings equal? The french '�' can be represented both as a single � or as e+�. It would be good, when you deal with say JSON or
maybe my own name, that you don't end up having 'Fr�d�ric' as 4 different people depending on which form is used. In any case, these encoding rules are specified in normalization forms (http://unicode.org/reports/tr15/). There's a bunch of similar reports for all ambiguities. If you deal with string length, you can have it in code points, in bytes, in code units, or in grapheme clusters.

As someone has mentioned, a unicode family emoji (as a single grapheme) can be built of 4 emoji people joined together, and each of these emoji people may require multiple bytes to be represented. But under a proper unicode library, they can be consumed as one user-readable entity (grapheme cluster) transparently.

It's a good thing to make a dedicated interface for unicode strings because it really handles strings better as a dedicated type. The real horror is having treated text as rando bytes for years, without ever giving them the proper abstraction it deserved. Strings should probably not have been this PL-side 'blob' data type that could handle byte streams and strings and all of that at once. You could do it easy under ASCII and latin-1 and it made sense at a lower level, but turns out that wouldn't carry out to real world languages.

IMO python3 suffered a whole lot because as far as I understand 1. they took away byte strings at first so everyone would be forced to use unicode when it made no sense (they just flipped the bad decision around) 2. it took a long while to bring back 3. everyone had to learn the conceptual difference between byte strings and unicode at once.

# ? Nov 26, 2017 16:08

mystes: May 31, 2006

The people who were happy with python 2 are English speakers who were blithely ignoring Unicode issues and screwing things up for people using other languages. Its defaults are not acceptable in 2017.

On the other hand, dynamic typing does seem to make this more confusing than necessary for people who really do want to deal with bytes.

# ? Nov 26, 2017 16:36

Notorious b.s.d.: Jan 25, 2003; by Reene

MononcQc posted:

It's a good thing to make a dedicated interface for unicode strings because it really handles strings better as a dedicated type. The real horror is having treated text as rando bytes for years, without ever giving them the proper abstraction it deserved. Strings should probably not have been this PL-side 'blob' data type that could handle byte streams and strings and all of that at once. You could do it easy under ASCII and latin-1 and it made sense at a lower level, but turns out that wouldn't carry out to real world languages.

importantly, filenames on every extant platform are bags of bytes, not fully-fledged unicode strings

if you are going to handle unicode correctly, you have to have clear interfaces to produce and consume those opaque bags of bytes, or else you will fail

naturally, english-speaking programmers really hate all this extra "cruft" and beg for their old dumb C strings

MononcQc posted:

IMO python3 suffered a whole lot because as far as I understand 1. they took away byte strings at first so everyone would be forced to use unicode when it made no sense (they just flipped the bad decision around) 2. it took a long while to bring back 3. everyone had to learn the conceptual difference between byte strings and unicode at once.

worst of all, python 3 had almost nothing to offer users

as a person who just wants to bang out a quick data processor in numpy, there was 0 reason to use python 3 over 2

# ? Nov 26, 2017 17:03

MrMoo: Sep 14, 2000

Notorious b.s.d. posted:

naturally, english-speaking programmers really hate all this extra "cruft" and beg for their old dumb C strings

Also, Chinese-speaking programmers hate the overheads of Unicode compared to random locale encoding.

What is amazing is that Traditional Chinese and SImplified Chinese are two separate families of locale encoding that are incompatible and without signatures. Taiwan and Hong Kong are not isolated people masses, especially the latter, their primary income is business with China. Foxconn is a Taiwan company manufacturing tech equipment in China, that's directly in the IT world. Tencent the $500 billion tech company still updates their email software Foxmail that only supports ASCII and some lovely subset of GB18030 that is SImplified-only. Politics and arrogance is the norm, so where there are US programmers loving ASCII there are more Chinese programmers who think everyone should kowtow to GB18030.

# ? Nov 26, 2017 17:25

Malcolm XML: Aug 8, 2009; I always knew it would end like ｔｈｉｓ．

MrMoo posted:

Also, Chinese-speaking programmers hate the overheads of Unicode compared to random locale encoding.

What is amazing is that Traditional Chinese and SImplified Chinese are two separate families of locale encoding that are incompatible and without signatures. Taiwan and Hong Kong are not isolated people masses, especially the latter, their primary income is business with China. Foxconn is a Taiwan company manufacturing tech equipment in China, that's directly in the IT world. Tencent the $500 billion tech company still updates their email software Foxmail that only supports ASCII and some lovely subset of GB18030 that is SImplified-only. Politics and arrogance is the norm, so where there are US programmers loving ASCII there are more Chinese programmers who think everyone should kowtow to GB18030.

Also Unicode is basically institutionally racist by demanding Han unification but not the equivalent for western languages. This is because they desperately stuck to the idea that 16bits would be enough for everything and refuse to change despite now having 32bits, and the Unicode committees are run by idiot white western nerds.

That said their Indic script support is surprisingly ok, not great, but ok, because the big pain in those comes from the typesetting engines b/c of the prevalence of ligatures

# ? Nov 26, 2017 17:55

Adbot: ADBOT LOVES YOU

# ? May 17, 2024 07:32

tef: May 30, 2004; -> some l-system crap ->

Suspicious Dish posted:

as someone that does a lot of reverse engineering, network protocols, and in a lot of cases, deals with completely hosed filenames, python3 is a lot worse of a language for me.

tbh python2 was pretty lovely at it but eh python3 really hosed those usecases

i found bytearrays fixed a bunch of stuff but at least they put bytestring formatting back in

# ? Nov 26, 2017 18:24

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > p-lang thread: (now (have you (problems two)))

«‹›1784 »