|
cinci zoo sniper posted:i see. i was aware of the tool but never had to use it. sounds like a really stupid think to gently caress up if it was a first party thing, which i assume it is it was a really stupid thing to gently caress up also python decided that all filenames are unicode. full stop. on posix this means that filenames are interpreted based on POSIX locale on Linux. also command line arguments too. yes, POSIX is hosed up when it comes to character codes, but your decree by law doesn't make things better, it just makes things worse since all the problems of POSIX are still there. my favorite part of this is that for a long time, the built-in zipfile module also translated filenames based off of the system locale, meaning that the same zip file could have different filenames on different systems with different character code settings. also there is no way to access the raw bytes of the filename. i think they eventually added this. the same thing happened with command line arguments too.
|
# ? Nov 26, 2017 01:18 |
|
|
# ? May 17, 2024 07:32 |
|
also file io is ridiculous. the "b" flag to open in python3 drastically changes the return value from unicode string to bytestring (a completely different datastructure which is more like an array of integers rather than a string), and there is no way to test what kind of file you were handed, other than to read 0 bytes and test the return value. in previous versions all this did was change whether '\n' was translated to '\r\n' on windows when writing files, iirc. i think if you pass a "bytestring file" to several built-in python modules it crashes because it assumes the return value of file.read is a unicode string. this goes into rjmccall's point about static typing: there's this murky area here where now there's two different file types: one that has its api use unicode strings and the other bytestrings, and there's no way to tell them apart. there really should actually be two file types. in practice people stopped using bytestring files.
|
# ? Nov 26, 2017 01:22 |
|
cinci zoo sniper posted:i completely dont get the byte-string and text-string difference, or its implications even if its my naive assumption that one is array of bytes, and other is array of chars 1 char ≥ 1 byte, and most importantly variable. When 1 char = 1 byte you can use tools like memcmp, memcpy, trivial offset references for nth characters. It is frequently why UTF-8 is used a storage and message encoding, because it saves space, but UTF-16 or UTF-32 is used within UI applications because characters become fixed width and offset references become trivial again. In Python a byte-string picks up your locale encoding which otherwise is only detectable by complicated statistical analysis on text, browsers like Firefox and Chrome do this when presented with non-Unicode text. Frequently fail still at that. MrMoo fucked around with this message at 01:28 on Nov 26, 2017 |
# ? Nov 26, 2017 01:23 |
oic, ty. this "b" input thing about bytestrings has me worried now, i should run quickly through the py3 io docs at some point since i ported one of my current projects, that deals with io alot, to py3 and while it seemingly works identically to how it did before, i havent spent too much time testing it thoroughly yet MrMoo posted:1 char ≥ 1 byte, and most importantly variable. When 1 char = 1 byte you can use tools like memcmp, memcpy, trivial offset references for nth characters.
|
|
# ? Nov 26, 2017 01:38 |
|
MrMoo posted:but UTF-16 or UTF-32 is used within UI applications because characters become fixed width and offset references become trivial again.
|
# ? Nov 26, 2017 01:45 |
|
MrMoo posted:It is frequently why UTF-8 is used a storage and message encoding, because it saves space, but UTF-16 or UTF-32 is used within UI applications because characters become fixed width and offset references become trivial again. my dude, let me introduce you to surrogate pairs
|
# ? Nov 26, 2017 01:45 |
|
More to the point, you almost never actually want to get the Nth character of a text string. The things you want to do with an array of bytes don't overlap all that much with the things you want to do with a piece of text, even though the text is commonly stored as an array of bytes.
|
# ? Nov 26, 2017 01:52 |
|
Fiedler posted:UTF-16 is not fixed-width. I meant UCS-2, but that doesn't stop people calling it UTF-16
|
# ? Nov 26, 2017 01:58 |
|
UCS-2 and UCS-4/UTF-32 aren't fixed width either
|
# ? Nov 26, 2017 02:03 |
|
MrMoo posted:I meant UCS-2, but that doesn't stop people calling it UTF-16 let us tell u about something called combining characters / 👨👨👦👦
|
# ? Nov 26, 2017 02:18 |
|
pseudorandom name posted:UCS-2 and UCS-4/UTF-32 aren't fixed width either uhhhh, I thought the whole point of UTF-32 was that it was fixed-width
|
# ? Nov 26, 2017 02:20 |
|
pseudorandom name posted:UCS-2 and UCS-4/UTF-32 aren't fixed width either wait, what?
|
# ? Nov 26, 2017 02:23 |
|
ComradeCosmobot posted:let us tell u about something called combining characters please do / 👩🏾💻
|
# ? Nov 26, 2017 02:24 |
|
Fixed width for code points not characters, ugh. Way too much encoding chat already
|
# ? Nov 26, 2017 02:33 |
|
pseudorandom name posted:UCS-2 and UCS-4/UTF-32 aren't fixed width either UCS-2 characters are fixed at width 16 bits, UCS-4/UTF-32 are fixed at 32 bits. Is this wrong?
|
# ? Nov 26, 2017 02:41 |
|
https://mathias.gaunard.com/unicode/doc/html/unicode/introduction_to_unicode.html no really, if you're confused by that, click that link. I know it says introduction to unicode but it isn't long and it's really just about exactly this point
|
# ? Nov 26, 2017 02:44 |
|
Easiest with the example from ComradeCosmobot: 👨👨👦👦 (Man, Man, Boy, Boy) is a single Unicode character comprised of four combined characters, seven code points: Codepoint #1 = Man 👨 = U+1F468 Codepoint #2 = Zero-width joiner = U+200D Codepoint #3 = Man 👨 = U+1F468 Codepoint #4 = Zero-width joiner = U+200D Codepoint #5 = Boy 👦 = U+1F466 Codepoint #6 = Zero-width joiner = U+200D Codepoint #7 = Boy 👦 = U+1F466 Where U + <hex pair> [+ <hex pair>] means the raw Unicode code point you can usually reference in a plang like "\u1f468" . The raw bytes on MacOS look like this, 26-bytes long: $ cat unicode.txt 👨👨👦👦 $ hexdump unicode.txt 0000000 f0 9f 91 a8 e2 80 8d f0 9f 91 a8 e2 80 8d f0 9f 0000010 91 a6 e2 80 8d f0 9f 91 a6 0a 000001a i.e. Unicode is quite complicated and characters can even have multiple meanings due to special modifying codepoints. MrMoo fucked around with this message at 03:32 on Nov 26, 2017 |
# ? Nov 26, 2017 02:53 |
|
I don't know what that is meant to be an example of.
|
# ? Nov 26, 2017 03:10 |
|
Doom Mathematic posted:I don't know what that is meant to be an example of. UTF-32 isn’t a fixed width encoding.
|
# ? Nov 26, 2017 03:48 |
|
MrMoo posted:Easiest with the example from ComradeCosmobot: the best part about that example is that i lied: the example i gave technically has no combining characters and, depending on which emoji zwj sequences your standard library is aware of (if any; they technically aren’t required for a compliant implementation!), might count as 7 characters or 1 character (a true combining character would involve a codepoint that has a non-zero combining class) ComradeCosmobot fucked around with this message at 04:13 on Nov 26, 2017 |
# ? Nov 26, 2017 04:11 |
|
pseudorandom name posted:UTF-32 isn’t a fixed width encoding. 4 bytes is enough to represent every code point that has been and can be defined by unicode as it currently stands. what am i missing?
|
# ? Nov 26, 2017 05:31 |
|
redleader posted:4 bytes is enough to represent every code point that has been and can be defined by unicode as it currently stands. what am i missing? they’re saying it’s not a fixed-width encoding of characters. you’re saying it’s a fixed-worth encoding of code points I’m not knowledgeable enough to comment on whether this is too pedantic a distinction to make in this context (are the various encoding standards defined by code point or by character and does it matter?), but I can comfortably say that if you’re being too pedantic in a discussion about loving Unicode then maybe it’s time to take a good look in the mirror and ask yourself who you wanna be in this world
|
# ? Nov 26, 2017 06:11 |
|
redleader posted:4 bytes is enough to represent every code point that has been and can be defined by unicode as it currently stands. what am i missing? characters can be made up of multiple code points combined (see the post above yours), so just because you can jump up 4 bytes to the next code point doesn't mean you've advanced one character, which is probably what your user wanted to do with the code you wrote, since that's what she saw on the screen
|
# ? Nov 26, 2017 06:11 |
|
the point is that while you can O(1) index to a code point in those encodings, there is no actual semantic operation on strings that takes advantage of that because you have to be aware of the possibility that you've indexed into the middle of a multi-code-point character. you could scan backwards to see if you are but you can do that with utf-8 and utf-16, too there are simpler examples even in ucs-2, like the combining hangul or (more familiar to westerners) the combining diacritics
|
# ? Nov 26, 2017 07:45 |
|
there's also fun things like the fi ligature which is a single code point, but in a correctly implemented Unicode text entry field will behave as two characters with respect to Backspace/Delete or cursor movement
|
# ? Nov 26, 2017 09:09 |
|
normalized forms oooo
|
# ? Nov 26, 2017 12:45 |
|
pseudorandom name posted:there's also fun things like the fi ligature which is a single code point, but in a correctly implemented Unicode text entry field will behave as two characters with respect to Backspace/Delete or cursor movement I really like this kind of behaviour, I have my vs code set up with combining ligatures that are still the ascii chars in the file and semantically when editing, but the visually they become the unicode ligatures
|
# ? Nov 26, 2017 12:57 |
|
gonadic io posted:I really like this kind of behaviour, I have my vs code set up with combining ligatures that are still the ascii chars in the file and semantically when editing, but the visually they become the unicode ligatures hows it do if you copy to another application? just wondering whats going on in the frontend
|
# ? Nov 26, 2017 13:11 |
Powaqoatse posted:hows it do if you copy to another application? just wondering whats going on in the frontend sounds like he is just using fira code or equivalent
|
|
# ? Nov 26, 2017 13:19 |
|
cinci zoo sniper posted:sounds like he is just using fira code or equivalent That's it yeah. It's purely a visual thing, all selection and copy pasting and editing behave as if they were separate characters
|
# ? Nov 26, 2017 13:32 |
yea i use too. www ligature is extremely annoying but when it finally gets to me ill just look up that russia fira code clone i saw the other day that has bunch of customisation options
|
|
# ? Nov 26, 2017 13:39 |
|
cinci zoo sniper posted:yea i use too. www ligature is extremely annoying but when it finally gets to me ill just look up that russia fira code clone i saw the other day that has bunch of customisation options i could ve sworn the main fira site contains variants without the www ligature
|
# ? Nov 26, 2017 13:48 |
|
oh neat
|
# ? Nov 26, 2017 15:21 |
|
also you can open the ttf/otf files in like fontforge and gently caress with the feature config to disable specific ligatures if you want
|
# ? Nov 26, 2017 15:25 |
|
Here's an attempt at a quick run-through of vocabulary and consequences. In ASCII and latin-1 (ISO-8859-N) you mostly can deal with the following words, which are all synonymous:
a + ` = à Fortunately, in latin-1, most of these naughty diacritics have been bundled into specific characters. In French, for example, this final 'letter' can be represented under a single code (224). There are however complications coming from that. One of them is 'collation' (the sorting order of letters). For example, a and à in French ought to sort in the same portion of the alphabet (before 'b'), but by default, they end up sorting after 'z'. In Danish, A is the first letter of the alphabet, but Å is last. Also Å is seen as a ligature of Aa; Aa is sorted like Å rather than two letters 'Aa' one after the other. Swedish has different diacritics with different orders: Å, Ä, Ö. Enter UNICODE. To make a matter short (and with a gross oversimplifications), we have the following terms in vocabulary:
Like, what makes two strings equal? The french 'é' can be represented both as a single é or as e+´. It would be good, when you deal with say JSON or maybe my own name, that you don't end up having 'Frédéric' as 4 different people depending on which form is used. In any case, these encoding rules are specified in normalization forms (http://unicode.org/reports/tr15/). There's a bunch of similar reports for all ambiguities. If you deal with string length, you can have it in code points, in bytes, in code units, or in grapheme clusters. As someone has mentioned, a unicode family emoji (as a single grapheme) can be built of 4 emoji people joined together, and each of these emoji people may require multiple bytes to be represented. But under a proper unicode library, they can be consumed as one user-readable entity (grapheme cluster) transparently. It's a good thing to make a dedicated interface for unicode strings because it really handles strings better as a dedicated type. The real horror is having treated text as rando bytes for years, without ever giving them the proper abstraction it deserved. Strings should probably not have been this PL-side 'blob' data type that could handle byte streams and strings and all of that at once. You could do it easy under ASCII and latin-1 and it made sense at a lower level, but turns out that wouldn't carry out to real world languages. IMO python3 suffered a whole lot because as far as I understand 1. they took away byte strings at first so everyone would be forced to use unicode when it made no sense (they just flipped the bad decision around) 2. it took a long while to bring back 3. everyone had to learn the conceptual difference between byte strings and unicode at once.
|
# ? Nov 26, 2017 16:08 |
|
The people who were happy with python 2 are English speakers who were blithely ignoring Unicode issues and screwing things up for people using other languages. Its defaults are not acceptable in 2017. On the other hand, dynamic typing does seem to make this more confusing than necessary for people who really do want to deal with bytes.
|
# ? Nov 26, 2017 16:36 |
|
MononcQc posted:It's a good thing to make a dedicated interface for unicode strings because it really handles strings better as a dedicated type. The real horror is having treated text as rando bytes for years, without ever giving them the proper abstraction it deserved. Strings should probably not have been this PL-side 'blob' data type that could handle byte streams and strings and all of that at once. You could do it easy under ASCII and latin-1 and it made sense at a lower level, but turns out that wouldn't carry out to real world languages. importantly, filenames on every extant platform are bags of bytes, not fully-fledged unicode strings if you are going to handle unicode correctly, you have to have clear interfaces to produce and consume those opaque bags of bytes, or else you will fail naturally, english-speaking programmers really hate all this extra "cruft" and beg for their old dumb C strings MononcQc posted:IMO python3 suffered a whole lot because as far as I understand 1. they took away byte strings at first so everyone would be forced to use unicode when it made no sense (they just flipped the bad decision around) 2. it took a long while to bring back 3. everyone had to learn the conceptual difference between byte strings and unicode at once. worst of all, python 3 had almost nothing to offer users as a person who just wants to bang out a quick data processor in numpy, there was 0 reason to use python 3 over 2
|
# ? Nov 26, 2017 17:03 |
|
Notorious b.s.d. posted:naturally, english-speaking programmers really hate all this extra "cruft" and beg for their old dumb C strings Also, Chinese-speaking programmers hate the overheads of Unicode compared to random locale encoding. What is amazing is that Traditional Chinese and SImplified Chinese are two separate families of locale encoding that are incompatible and without signatures. Taiwan and Hong Kong are not isolated people masses, especially the latter, their primary income is business with China. Foxconn is a Taiwan company manufacturing tech equipment in China, that's directly in the IT world. Tencent the $500 billion tech company still updates their email software Foxmail that only supports ASCII and some lovely subset of GB18030 that is SImplified-only. Politics and arrogance is the norm, so where there are US programmers loving ASCII there are more Chinese programmers who think everyone should kowtow to GB18030.
|
# ? Nov 26, 2017 17:25 |
|
MrMoo posted:Also, Chinese-speaking programmers hate the overheads of Unicode compared to random locale encoding. Also Unicode is basically institutionally racist by demanding Han unification but not the equivalent for western languages. This is because they desperately stuck to the idea that 16bits would be enough for everything and refuse to change despite now having 32bits, and the Unicode committees are run by idiot white western nerds. That said their Indic script support is surprisingly ok, not great, but ok, because the big pain in those comes from the typesetting engines b/c of the prevalence of ligatures
|
# ? Nov 26, 2017 17:55 |
|
|
# ? May 17, 2024 07:32 |
|
Suspicious Dish posted:as someone that does a lot of reverse engineering, network protocols, and in a lot of cases, deals with completely hosed filenames, python3 is a lot worse of a language for me. tbh python2 was pretty lovely at it but eh python3 really hosed those usecases i found bytearrays fixed a bunch of stuff but at least they put bytestring formatting back in
|
# ? Nov 26, 2017 18:24 |