Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

cinci zoo sniper posted:

i see. i was aware of the tool but never had to use it. sounds like a really stupid think to gently caress up if it was a first party thing, which i assume it is

re: your other post, what problems do you face with filenames in py3 that weren't there in py2?

it was a really stupid thing to gently caress up

also python decided that all filenames are unicode. full stop. on posix this means that filenames are interpreted based on POSIX locale on Linux. also command line arguments too. yes, POSIX is hosed up when it comes to character codes, but your decree by law doesn't make things better, it just makes things worse since all the problems of POSIX are still there.

my favorite part of this is that for a long time, the built-in zipfile module also translated filenames based off of the system locale, meaning that the same zip file could have different filenames on different systems with different character code settings. also there is no way to access the raw bytes of the filename. i think they eventually added this.

the same thing happened with command line arguments too.

Adbot
ADBOT LOVES YOU

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
also file io is ridiculous. the "b" flag to open in python3 drastically changes the return value from unicode string to bytestring (a completely different datastructure which is more like an array of integers rather than a string), and there is no way to test what kind of file you were handed, other than to read 0 bytes and test the return value. in previous versions all this did was change whether '\n' was translated to '\r\n' on windows when writing files, iirc.

i think if you pass a "bytestring file" to several built-in python modules it crashes because it assumes the return value of file.read is a unicode string.

this goes into rjmccall's point about static typing: there's this murky area here where now there's two different file types: one that has its api use unicode strings and the other bytestrings, and there's no way to tell them apart. there really should actually be two file types.

in practice people stopped using bytestring files.

MrMoo
Sep 14, 2000

cinci zoo sniper posted:

i completely dont get the byte-string and text-string difference, or its implications even if its my naive assumption that one is array of bytes, and other is array of chars

1 char ≥ 1 byte, and most importantly variable. When 1 char = 1 byte you can use tools like memcmp, memcpy, trivial offset references for nth characters.

It is frequently why UTF-8 is used a storage and message encoding, because it saves space, but UTF-16 or UTF-32 is used within UI applications because characters become fixed width and offset references become trivial again.

In Python a byte-string picks up your locale encoding which otherwise is only detectable by complicated statistical analysis on text, browsers like Firefox and Chrome do this when presented with non-Unicode text. Frequently fail still at that.

MrMoo fucked around with this message at 01:28 on Nov 26, 2017

cinci zoo sniper
Mar 15, 2013





oic, ty. this "b" input thing about bytestrings has me worried now, i should run quickly through the py3 io docs at some point since i ported one of my current projects, that deals with io alot, to py3 and while it seemingly works identically to how it did before, i havent spent too much time testing it thoroughly yet

MrMoo posted:

1 char ≥ 1 byte, and most importantly variable. When 1 char = 1 byte you can use tools like memcmp, memcpy, trivial offset references for nth characters.
oh, right :doh:

Fiedler
Jun 29, 2002

I, for one, welcome our new mouse overlords.

MrMoo posted:

but UTF-16 or UTF-32 is used within UI applications because characters become fixed width and offset references become trivial again.
UTF-16 is not fixed-width.

invlwhen
Jul 28, 2012

please do your best

MrMoo posted:

It is frequently why UTF-8 is used a storage and message encoding, because it saves space, but UTF-16 or UTF-32 is used within UI applications because characters become fixed width and offset references become trivial again.

my dude, let me introduce you to surrogate pairs

Doom Mathematic
Sep 2, 2008
More to the point, you almost never actually want to get the Nth character of a text string. The things you want to do with an array of bytes don't overlap all that much with the things you want to do with a piece of text, even though the text is commonly stored as an array of bytes.

MrMoo
Sep 14, 2000

Fiedler posted:

UTF-16 is not fixed-width.

I meant UCS-2, but that doesn't stop people calling it UTF-16 :lol:

pseudorandom name
May 6, 2007

UCS-2 and UCS-4/UTF-32 aren't fixed width either

ComradeCosmobot
Dec 4, 2004

USPOL July

MrMoo posted:

I meant UCS-2, but that doesn't stop people calling it UTF-16 :lol:

let us tell u about something called combining characters
/
👨‍👨‍👦‍👦

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

pseudorandom name posted:

UCS-2 and UCS-4/UTF-32 aren't fixed width either

uhhhh, I thought the whole point of UTF-32 was that it was fixed-width

redleader
Aug 18, 2005

Engage according to operational parameters

pseudorandom name posted:

UCS-2 and UCS-4/UTF-32 aren't fixed width either

wait, what?

pseudorandom name
May 6, 2007

ComradeCosmobot posted:

let us tell u about something called combining characters
/
👨‍👨‍👦‍👦

please do
/
👩🏾‍💻

MrMoo
Sep 14, 2000

:negative: Fixed width for code points not characters, ugh. Way too much encoding chat already :gonk:

Doom Mathematic
Sep 2, 2008

pseudorandom name posted:

UCS-2 and UCS-4/UTF-32 aren't fixed width either

UCS-2 characters are fixed at width 16 bits, UCS-4/UTF-32 are fixed at 32 bits. Is this wrong?

crazypenguin
Mar 9, 2005
nothing witty here, move along
https://mathias.gaunard.com/unicode/doc/html/unicode/introduction_to_unicode.html

no really, if you're confused by that, click that link. I know it says introduction to unicode but it isn't long and it's really just about exactly this point

MrMoo
Sep 14, 2000

Easiest with the example from ComradeCosmobot:

👨‍👨‍👦‍👦 (Man, Man, Boy, Boy) is a single Unicode character comprised of four combined characters, seven code points:

Codepoint #1 = Man 👨 = U+1F468
C‍odepoint #2 = Zero-width joiner = U+200D
Codepoint #3 = Man 👨 = U+1F468
‍Codepoint #4 = Zero-width joiner = U+200D
Codepoint #5 = Boy 👦 = U+1F466‍
Codepoint #6 = Zero-width joiner = U+200D
Codepoint #7 = Boy 👦 = U+1F466

Where U + <hex pair> [+ <hex pair>] means the raw Unicode code point you can usually reference in a plang like "\u1f468" .

The raw bytes on MacOS look like this, 26-bytes long:

$ cat unicode.txt
👨‍👨‍👦‍👦
$ hexdump unicode.txt
0000000 f0 9f 91 a8 e2 80 8d f0 9f 91 a8 e2 80 8d f0 9f
0000010 91 a6 e2 80 8d f0 9f 91 a6 0a
000001a


i.e. Unicode is quite complicated and characters can even have multiple meanings due to special modifying codepoints.

MrMoo fucked around with this message at 03:32 on Nov 26, 2017

Doom Mathematic
Sep 2, 2008
I don't know what that is meant to be an example of.

pseudorandom name
May 6, 2007

Doom Mathematic posted:

I don't know what that is meant to be an example of.

UTF-32 isn’t a fixed width encoding.

ComradeCosmobot
Dec 4, 2004

USPOL July

MrMoo posted:

Easiest with the example from ComradeCosmobot:

👨‍👨‍👦‍👦 (Man, Man, Boy, Boy) is a single Unicode character comprised of four combined characters, seven code points:

Codepoint #1 = Man 👨 = U+1F468
C‍odepoint #2 = Zero-width joiner = U+200D
Codepoint #3 = Man 👨 = U+1F468
‍Codepoint #4 = Zero-width joiner = U+200D
Codepoint #5 = Boy 👦 = U+1F466‍
Codepoint #6 = Zero-width joiner = U+200D
Codepoint #7 = Boy 👦 = U+1F466

Where U + <hex pair> [+ <hex pair>] means the raw Unicode code point you can usually reference in a plang like "\u1f468" .

The raw bytes on MacOS look like this, 26-bytes long:

$ cat unicode.txt
👨‍👨‍👦‍👦
$ hexdump unicode.txt
0000000 f0 9f 91 a8 e2 80 8d f0 9f 91 a8 e2 80 8d f0 9f
0000010 91 a6 e2 80 8d f0 9f 91 a6 0a
000001a


i.e. Unicode is quite complicated and characters can even have multiple meanings due to special modifying codepoints.

the best part about that example is that i lied: the example i gave technically has no combining characters and, depending on which emoji zwj sequences your standard library is aware of (if any; they technically aren’t required for a compliant implementation!), might count as 7 characters or 1 character

(a true combining character would involve a codepoint that has a non-zero combining class)

ComradeCosmobot fucked around with this message at 04:13 on Nov 26, 2017

redleader
Aug 18, 2005

Engage according to operational parameters

pseudorandom name posted:

UTF-32 isn’t a fixed width encoding.

4 bytes is enough to represent every code point that has been and can be defined by unicode as it currently stands. what am i missing?

pokeyman
Nov 26, 2006

That elephant ate my entire platoon.

redleader posted:

4 bytes is enough to represent every code point that has been and can be defined by unicode as it currently stands. what am i missing?

they’re saying it’s not a fixed-width encoding of characters. you’re saying it’s a fixed-worth encoding of code points

I’m not knowledgeable enough to comment on whether this is too pedantic a distinction to make in this context (are the various encoding standards defined by code point or by character and does it matter?), but I can comfortably say that if you’re being too pedantic in a discussion about loving Unicode then maybe it’s time to take a good look in the mirror and ask yourself who you wanna be in this world

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003

redleader posted:

4 bytes is enough to represent every code point that has been and can be defined by unicode as it currently stands. what am i missing?

characters can be made up of multiple code points combined (see the post above yours), so just because you can jump up 4 bytes to the next code point doesn't mean you've advanced one character, which is probably what your user wanted to do with the code you wrote, since that's what she saw on the screen

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
the point is that while you can O(1) index to a code point in those encodings, there is no actual semantic operation on strings that takes advantage of that because you have to be aware of the possibility that you've indexed into the middle of a multi-code-point character. you could scan backwards to see if you are but you can do that with utf-8 and utf-16, too

there are simpler examples even in ucs-2, like the combining hangul or (more familiar to westerners) the combining diacritics

pseudorandom name
May 6, 2007

there's also fun things like the fi ligature which is a single code point, but in a correctly implemented Unicode text entry field will behave as two characters with respect to Backspace/Delete or cursor movement

Malcolm XML
Aug 8, 2009

I always knew it would end like this.
normalized forms oooo

gonadic io
Feb 16, 2011

>>=

pseudorandom name posted:

there's also fun things like the fi ligature which is a single code point, but in a correctly implemented Unicode text entry field will behave as two characters with respect to Backspace/Delete or cursor movement

I really like this kind of behaviour, I have my vs code set up with combining ligatures that are still the ascii chars in the file and semantically when editing, but the visually they become the unicode ligatures

Carthag Tuek
Oct 15, 2005

Tider skal komme,
tider skal henrulle,
slægt skal følge slægters gang



gonadic io posted:

I really like this kind of behaviour, I have my vs code set up with combining ligatures that are still the ascii chars in the file and semantically when editing, but the visually they become the unicode ligatures

hows it do if you copy to another application? just wondering whats going on in the frontend

cinci zoo sniper
Mar 15, 2013




Powaqoatse posted:

hows it do if you copy to another application? just wondering whats going on in the frontend

sounds like he is just using fira code or equivalent

gonadic io
Feb 16, 2011

>>=

cinci zoo sniper posted:

sounds like he is just using fira code or equivalent

That's it yeah. It's purely a visual thing, all selection and copy pasting and editing behave as if they were separate characters

cinci zoo sniper
Mar 15, 2013




yea i use too. www ligature is extremely annoying but when it finally gets to me ill just look up that russia fira code clone i saw the other day that has bunch of customisation options

Workaday Wizard
Oct 23, 2009

by Pragmatica

cinci zoo sniper posted:

yea i use too. www ligature is extremely annoying but when it finally gets to me ill just look up that russia fira code clone i saw the other day that has bunch of customisation options

i could ve sworn the main fira site contains variants without the www ligature

Carthag Tuek
Oct 15, 2005

Tider skal komme,
tider skal henrulle,
slægt skal følge slægters gang



oh neat

Carthag Tuek
Oct 15, 2005

Tider skal komme,
tider skal henrulle,
slægt skal følge slægters gang



also you can open the ttf/otf files in like fontforge and gently caress with the feature config to disable specific ligatures if you want

MononcQc
May 29, 2007

Here's an attempt at a quick run-through of vocabulary and consequences. In ASCII and latin-1 (ISO-8859-N) you mostly can deal with the following words, which are all synonymous:
  • character
  • letter
  • symbol
In some variants, you also have to add the word 'diacritic' or 'accent' which let you modify a character in terms of linguistics:

a + ` = à

Fortunately, in latin-1, most of these naughty diacritics have been bundled into specific characters. In French, for example, this final 'letter' can be represented under a single code (224).

There are however complications coming from that. One of them is 'collation' (the sorting order of letters). For example, a and à in French ought to sort in the same portion of the alphabet (before 'b'), but by default, they end up sorting after 'z'.

In Danish, A is the first letter of the alphabet, but Å is last. Also Å is seen as a ligature of Aa; Aa is sorted like Å rather than two letters 'Aa' one after the other. Swedish has different diacritics with different orders: Å, Ä, Ö.

Enter UNICODE. To make a matter short (and with a gross oversimplifications), we have the following terms in vocabulary:
  • character: smallest representable unit in a language, in the abstract. '`' is a character, so is 'a', and so is 'à'
  • glyph: the visual representation of a character. Think of it as a character from the point of view of the font or typeface designer. For example, the same glyph may be used for the capital letter 'pi' and the mathematical symbol for a product: ∏. Similarly, capital 'Sigma' and the mathematical 'sum' may have different character representation, but the same ∑ glyph.
  • letter: an element of an alphabet
  • codepoint: A given value in the unicode space. There's a big table with a crapload of characters in them, and every character is assigned a codepoint, as a unique identifier for it.
  • code unit: a specific encoding of a given code point. This refers to bits, not just the big table. The same code point may have different
    code units in UTF-8, UTF-16, and UTF-32, which are 3 'encodings' of unicode.
  • grapheme: what the user thinks of a 'character'
  • grapheme cluster: what you want to think of as a 'character' for your user's sake. Basically, 'a' and '`' can be two graphemes, but if I combine them together as 'à', I want to be able to say that a single 'delete' key press will remove both the '`' and the 'a' at once from my text, and not be left with one or the other.
The big fun bit is that unicode takes all these really lovely complicated linguistic things and specifies how they should be handled.

Like, what makes two strings equal? The french 'é' can be represented both as a single é or as e+´. It would be good, when you deal with say JSON or
maybe my own name, that you don't end up having 'Frédéric' as 4 different people depending on which form is used. In any case, these encoding rules are specified in normalization forms (http://unicode.org/reports/tr15/). There's a bunch of similar reports for all ambiguities. If you deal with string length, you can have it in code points, in bytes, in code units, or in grapheme clusters.

As someone has mentioned, a unicode family emoji (as a single grapheme) can be built of 4 emoji people joined together, and each of these emoji people may require multiple bytes to be represented. But under a proper unicode library, they can be consumed as one user-readable entity (grapheme cluster) transparently.

It's a good thing to make a dedicated interface for unicode strings because it really handles strings better as a dedicated type. The real horror is having treated text as rando bytes for years, without ever giving them the proper abstraction it deserved. Strings should probably not have been this PL-side 'blob' data type that could handle byte streams and strings and all of that at once. You could do it easy under ASCII and latin-1 and it made sense at a lower level, but turns out that wouldn't carry out to real world languages.

IMO python3 suffered a whole lot because as far as I understand 1. they took away byte strings at first so everyone would be forced to use unicode when it made no sense (they just flipped the bad decision around) 2. it took a long while to bring back 3. everyone had to learn the conceptual difference between byte strings and unicode at once.

mystes
May 31, 2006

The people who were happy with python 2 are English speakers who were blithely ignoring Unicode issues and screwing things up for people using other languages. Its defaults are not acceptable in 2017.

On the other hand, dynamic typing does seem to make this more confusing than necessary for people who really do want to deal with bytes.

Notorious b.s.d.
Jan 25, 2003

by Reene

MononcQc posted:

It's a good thing to make a dedicated interface for unicode strings because it really handles strings better as a dedicated type. The real horror is having treated text as rando bytes for years, without ever giving them the proper abstraction it deserved. Strings should probably not have been this PL-side 'blob' data type that could handle byte streams and strings and all of that at once. You could do it easy under ASCII and latin-1 and it made sense at a lower level, but turns out that wouldn't carry out to real world languages.

importantly, filenames on every extant platform are bags of bytes, not fully-fledged unicode strings

if you are going to handle unicode correctly, you have to have clear interfaces to produce and consume those opaque bags of bytes, or else you will fail

naturally, english-speaking programmers really hate all this extra "cruft" and beg for their old dumb C strings


MononcQc posted:

IMO python3 suffered a whole lot because as far as I understand 1. they took away byte strings at first so everyone would be forced to use unicode when it made no sense (they just flipped the bad decision around) 2. it took a long while to bring back 3. everyone had to learn the conceptual difference between byte strings and unicode at once.

worst of all, python 3 had almost nothing to offer users

as a person who just wants to bang out a quick data processor in numpy, there was 0 reason to use python 3 over 2

MrMoo
Sep 14, 2000

Notorious b.s.d. posted:

naturally, english-speaking programmers really hate all this extra "cruft" and beg for their old dumb C strings

Also, Chinese-speaking programmers hate the overheads of Unicode compared to random locale encoding.

What is amazing is that Traditional Chinese and SImplified Chinese are two separate families of locale encoding that are incompatible and without signatures. Taiwan and Hong Kong are not isolated people masses, especially the latter, their primary income is business with China. Foxconn is a Taiwan company manufacturing tech equipment in China, that's directly in the IT world. Tencent the $500 billion tech company still updates their email software Foxmail that only supports ASCII and some lovely subset of GB18030 that is SImplified-only. Politics and arrogance is the norm, so where there are US programmers loving ASCII there are more Chinese programmers who think everyone should kowtow to GB18030.

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

MrMoo posted:

Also, Chinese-speaking programmers hate the overheads of Unicode compared to random locale encoding.

What is amazing is that Traditional Chinese and SImplified Chinese are two separate families of locale encoding that are incompatible and without signatures. Taiwan and Hong Kong are not isolated people masses, especially the latter, their primary income is business with China. Foxconn is a Taiwan company manufacturing tech equipment in China, that's directly in the IT world. Tencent the $500 billion tech company still updates their email software Foxmail that only supports ASCII and some lovely subset of GB18030 that is SImplified-only. Politics and arrogance is the norm, so where there are US programmers loving ASCII there are more Chinese programmers who think everyone should kowtow to GB18030.

Also Unicode is basically institutionally racist by demanding Han unification but not the equivalent for western languages. This is because they desperately stuck to the idea that 16bits would be enough for everything and refuse to change despite now having 32bits, and the Unicode committees are run by idiot white western nerds.


That said their Indic script support is surprisingly ok, not great, but ok, because the big pain in those comes from the typesetting engines b/c of the prevalence of ligatures

Adbot
ADBOT LOVES YOU

tef
May 30, 2004

-> some l-system crap ->

Suspicious Dish posted:

as someone that does a lot of reverse engineering, network protocols, and in a lot of cases, deals with completely hosed filenames, python3 is a lot worse of a language for me.

tbh python2 was pretty lovely at it but eh python3 really hosed those usecases

i found bytearrays fixed a bunch of stuff but at least they put bytestring formatting back in

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply