Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
Yes, Python 3's ideas about Unicode are terrible and do not work out in the real world.

For a while, I've wanted to fork Python 2.7 and add all the stuff from 3.3+ and release a Python 2.9 (GvR himself said there will never be a Python 2.8, so...)

Adbot
ADBOT LOVES YOU

Zombywuf
Mar 29, 2008

Strange, those same ideas seem to work perfectly in .Net.

SupSuper
Apr 8, 2009

At the Heart of the city is an Alien horror, so vile and so powerful that not even death can claim it.

MrMoo posted:

The entire Unicode system is a minefield, consider that Python works on both Windows and OSX and both of those take different interpretations of Unicode coding patterns.
The usual "solution" is to always use your own Unicode representation and only convert when dealing with OS APIs.

But really i18n in programming is an absolute joke. Despite the Unicode standard having been around for years and years now, every single platform/language/library/etc handles it differently with their own problems and quirks that you need to cope with. Encodings are still a headache. And trying to "roll-your-own" is on par with date libraries, with an insane amount of complex rules and algorithms to deal with. I don't envy anyone that has to go down that path.

Zombywuf
Mar 29, 2008

SupSuper posted:

But really i18n in programming is an absolute joke.

This is primarily because the masses of programmers who are still at the level where they believe they are being unfairly put upon when asked to support any character encoding other than the one they are currently using.

You can barely get programmers to understand context-free grammars, why would expect them to understand pluralisation in human languages?

Lysidas
Jul 26, 2002

John Diefenbaker is a madman who thinks he's John Diefenbaker.
Pillbug

Suspicious Dish posted:

Yes, Python 3's ideas about Unicode are terrible and do not work out in the real world.

Funny, I feel the exact same way about Python 2's Unicode model. What exactly do you think is wrong with using Unicode code points for everything text, and byte strings for everything binary? Inadequacies in the codec machinery and the lack of a bytes.format method are the common points that I see raised. The former is definitely valid and is on the radar of the Python core devs; I think the latter is debatable (hence the lengthy debate at http://bugs.python.org/issue3982).

Armin Ronacher posted:

Confusing enough that Python 3 took the absolutely insanely radical step and removed .decode() from Unicode strings and .encode() from byte strings and caused me major frustration. In my mind this was an insanely stupid decision but I have been told more than once that my point of view is wrong and it won't be changed back.

He definitely lost me about being mad that str.decode and bytes.encode are gone -- these are horrible side effects of Python 2's "do what you think I mean unless there's a high byte somewhere" model and have no place in a sane language.

I want to plug Nick Coghlan's Python 3 Q & A again; I found it to be quite a good elevator pitch about why Python 2's Unicode model is broken:
https://ncoghlan_devs-python-notes.readthedocs.org/en/latest/python3/questions_and_answers.html

EDIT: Apparently the url BBcode tag can't handle underscores in URLs, either when manually entered or automatically added.

Zombywuf
Mar 29, 2008

Lysidas posted:

Funny, I feel the exact same way about Python 2's Unicode model. What exactly do you think is wrong with using Unicode code points for everything text, and byte strings for everything binary? Inadequacies in the codec machinery and the lack of a bytes.format method are the common points that I see raised. The former is definitely valid and is on the radar of the Python core devs; I think the latter is debatable (hence the lengthy debate at http://bugs.python.org/issue3982).

That debate is an artefact of the fact that Python fell into the everything's-a-method trap. Format should be an external method.

Also, I wonder what the correct result should be for bytes.format with a format string of "%i" when your locale is Arabic.

Arcsech
Aug 5, 2008
In case anybody wants to see a big batch of coding horrors all at once, the 2013 The International Obfuscated C Code Contest winners have been announced and the source code released.

I think my favorite is the 8086 emulator in 4043 bytes of code, including:

quote:

- Intel 8086/186 CPU
- 1MB RAM
- 8072A 3.5" floppy disk controller (1.44MB/720KB)
- Fixed disk controller (supports a single hard drive up to 528MB)
- Hercules graphics card with 720x348 2-color graphics (64KB video RAM), and CGA 80x25 16-color text mode support
- 8253 programmable interval timer (PIT)
- 8259 programmable interrupt controller (PIC)
- 8042 keyboard controller with 83-key XT-style keyboard
- MC146818 real-time clock
- PC speaker

And this (in the code):
code:
–64[T=1[O=32[L=(X=*Y&7)&1,o=X/2&1,l]=0,t=(c=y)&7,a=c/8&7,Y]>>6,g=~-T?y:(n)y,d=BX=y,l]

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

Lysidas posted:

Funny, I feel the exact same way about Python 2's Unicode model. What exactly do you think is wrong with using Unicode code points for everything text, and byte strings for everything binary? Inadequacies in the codec machinery and the lack of a bytes.format method are the common points that I see raised. The former is definitely valid and is on the radar of the Python core devs; I think the latter is debatable (hence the lengthy debate at http://bugs.python.org/issue3982).

Lots of things in the real world are not Unicode. A simple example is filenames on disk. On UNIX, these are bytes, and will always be bytes. Command line arguments to Python are decoded using the system locale's encoding, which means that in order to get back a byte path, I have to figure out the environment's encoding to get the bytes that were originally written in. There isn't any API for this outside of the locale module.

Lysidas
Jul 26, 2002

John Diefenbaker is a madman who thinks he's John Diefenbaker.
Pillbug

Suspicious Dish posted:

Lots of things in the real world are not Unicode. A simple example is filenames on disk. On UNIX, these are bytes, and will always be bytes. Command line arguments to Python are decoded using the system locale's encoding, which means that in order to get back a byte path, I have to figure out the environment's encoding to get the bytes that were originally written in. There isn't any API for this outside of the locale module.

Sure there is. os.fsencode and os.fsdecode. These use the system default encoding and the surrogateescape error handler so strange-looking byte sequences make it through unscathed:

Python code:
>>> a_bad_filename = b'Some cp1252: \xe9, some UTF-8: \xc3\xa9, some Shift-JIS: \x82\xe9\x82\xc7'
>>> from os import fsdecode, fsencode
>>> decoded = fsdecode(a_bad_filename)
>>> decoded
'Some cp1252: \udce9, some UTF-8: é, some Shift-JIS: \udc82\udce9\udc82\udcc7'
>>> a_bad_filename.decode('shift-jis', errors='surrogateescape')
'Some cp1252: \udce9, some UTF-8: テゥ, some Shift-JIS: るど'
>>> fsencode(decoded + '.txt')
b'Some cp1252: \xe9, some UTF-8: \xc3\xa9, some Shift-JIS: \x82\xe9\x82\xc7.txt'
EDIT: welp, that kind of got mangled by vBulletin I guess; FYI the HTML character references aren't in my terminal output

Lysidas fucked around with this message at 23:52 on Jan 5, 2014

Zombywuf
Mar 29, 2008

Suspicious Dish posted:

Lots of things in the real world are not Unicode. A simple example is filenames on disk. On UNIX, these are bytes, and will always be bytes. Command line arguments to Python are decoded using the system locale's encoding, which means that in order to get back a byte path, I have to figure out the environment's encoding to get the bytes that were originally written in. There isn't any API for this outside of the locale module.

If someone creates files on their hdd that are not in their system wide locale then the rest of us should not be expected to pay for their mistake.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

Lysidas posted:

Sure there is. os.fsencode and os.fsdecode. These use the system default encoding and the surrogateescape error handler so strange-looking byte sequences make it through unscathed:

But that doesn't help me if I'm supposed to remotely manage a system with another encoding.

Is there no way to pass a stream of bytes into a Python interpreter as a command line argument without it being turned into Unicode somehow?

I appreciate the idealism of the Python core developers, but it doesn't help me write scripts that work on lovely legacy systems.

Lysidas
Jul 26, 2002

John Diefenbaker is a madman who thinks he's John Diefenbaker.
Pillbug

Suspicious Dish posted:

But that doesn't help me if I'm supposed to remotely manage a system with another encoding.

Is there no way to pass a stream of bytes into a Python interpreter as a command line argument without it being turned into Unicode somehow?

I appreciate the idealism of the Python core developers, but it doesn't help me write scripts that work on lovely legacy systems.

I'm not quite sure what you mean -- it seems like there are two sides to "remotely manage a system with a different encoding". If you have local access to the remote filenames (in an unknown encoding), either treat them as bytes or let them pass through the os.fs{encode, decode} methods unchanged. You can os.fsencode them if you're using them as a command-line argument for something like ssh somewhere python script.py arg1 arg2.

On the other side, elements of sys.argv are str objects produced by os.fsdecode, and if you want the actual bytes back just immediately call os.fsencode.

test.py:
Python code:
#!/usr/bin/env python
from os import fsencode
import sys

for arg in sys.argv:
    print('raw: {!r}'.format(arg))
    print('fsencoded: {!r}'.format(fsencode(arg)))
call.py:
Python code:
#!/usr/bin/env python
from subprocess import Popen

command = [b'./test.py', b'h\xe9llo']
Popen(command).wait()
code:
$ ./call.py
raw: './test.py'
fsencoded: b'./test.py'
raw: 'h\udce9llo'
fsencoded: b'h\xe9llo'

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed
Anyone with filenames on a unix-like system that are not utf-8 strings deserves whatever they get. I would be much more inclines to go out of my way to break such a thing than to support it.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
I'd really love that too, but we have to support SHIFT-JIS systems.

Strong Sauce
Jul 2, 2003

You know I am not really your father.





Amberskin posted:

Related to the discussion we had a pair of weeks ago about how successful can be a decompiler, today this has appeared at /. :

http://valverde.me/2014/01/03/reverse-engineering-my-bank's-security-token/#.UslsCXnGyvJ

td;lr; some brazilian guy didn't like the android app his bank provides to get OTP codes, so he decompiled it and re-implemented the whole thing to run in an Arduino compatible board showing the codes in a small LCD string. :bravo:

How is this a horror?

Edit: Google Cache:

http://webcache.googleusercontent.c...n&ct=clnk&gl=us

Strong Sauce fucked around with this message at 00:36 on Jan 6, 2014

Pollyanna
Mar 5, 2005

Milk's on them.


Shift-JIS is the bane of my existence. :(

Zombywuf
Mar 29, 2008

ITYM: poo poo Jizz.

HORATIO HORNBLOWER
Sep 21, 2002

no ambition,
no talent,
no chance

DSauer posted:

Its always annoying that Window's solely uses UTF-16 for anything that handles text in Win32. WideCharToMultiByte and MultiByteToWideChar work flawlessly for doing conversions to and from UTF-16 but its still a pain in the butt to not just be able to say, "All our text is UTF-8" and throw everything into char strings.

It's worth noting that Java also uses UTF-16 internally and both platforms chose that encoding because the (misguided) thinking at the time was that would be enough to represent any Unicode code point, and you would be able to just say "All our text is UTF-16." Of course it didn't work out that way. I think UTF-8 has its own problems since the considerable overlap between it and ASCII/ISO-8859/Windows-1252 lets incautious programmers think they're supporting Unicode when really they just aren't testing any unusual inputs.

Zombywuf
Mar 29, 2008

HORATIO HORNBLOWER posted:

It's worth noting that Java also uses UTF-16 internally and both platforms chose that encoding because the (misguided) thinking at the time was that would be enough to represent any Unicode code point, and you would be able to just say "All our text is UTF-16." Of course it didn't work out that way. I think UTF-8 has its own problems since the considerable overlap between it and ASCII/ISO-8859/Windows-1252 lets incautious programmers think they're supporting Unicode when really they just aren't testing any unusual inputs.

I think you're confusing UCS-2 and UTF-16.

MrMoo
Sep 14, 2000

Pollyanna posted:

Shift-JIS is the bane of my existence. :(

SJIS + EUC_jp are ok as many tools support them, its all the other Japanese standards that are really annoying. Simplified Chinese encoding has a lot wanting too.

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed

Suspicious Dish posted:

I'd really love that too, but we have to support SHIFT-JIS systems.
Have you considered instead blowing up the shift-jis systems and shooting everyone responsible?

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

Plorkyeran posted:

Have you considered instead blowing up the shift-jis systems and shooting everyone responsible?

God yes.

Pilsner
Nov 23, 2002

Arcsech posted:

In case anybody wants to see a big batch of coding horrors all at once, the 2013 The International Obfuscated C Code Contest winners have been announced and the source code released.

I think my favorite is the 8086 emulator in 4043 bytes of code, including:


And this (in the code):
code:
–64[T=1[O=32[L=(X=*Y&7)&1,o=X/2&1,l]=0,t=(c=y)&7,a=c/8&7,Y]>>6,g=~-T?y:(n)y,d=BX=y,l]
Here's the full document on it:

http://www.ioccc.org/2013/cable3/hint.html

gently caress. Just incredible stuff. I wouldn't even know where to begin.

Sinestro
Oct 31, 2010

The perfect day needs the perfect set of wheels.
code:
int16_t byte;
:stonklol:

fritz
Jul 26, 2003

Plorkyeran posted:

Anyone with filenames on a unix-like system that are not utf-8 strings deserves whatever they get. I would be much more inclines to go out of my way to break such a thing than to support it.

The whole point of all this suffering with encodings is because we don't live in an 'anybody who does a thing we don't like with text can suck it' world.

Zombywuf
Mar 29, 2008

fritz posted:

The whole point of all this suffering with encodings is because we don't live in an 'anybody who does a thing we don't like with text can suck it' world.

There are two ways of dealing with people who insist on doing things outside of the realm of common sense (using multiple encodings in a single string in defiance of the single piece of metadata describing the string encoding is about as bad as it gets in this regard), one is to charge them through the nose for fixing their mess and the other is telling them to gently caress off.

If the "thing we don't like with text" is actively preventing people from developing software usable by people outside a small group who speak en_US then "suck it" is a pretty mild response.

tef
May 30, 2004

-> some l-system crap ->

Plorkyeran posted:

Anyone with filenames on a unix-like system that are not utf-8 strings deserves whatever they get. I would be much more inclines to go out of my way to break such a thing than to support it.

Also known as "gently caress You, Got My Encoding Sorted"


Zombywuf posted:

There are two ways of dealing with people who insist on doing things outside of the realm of common sense (using multiple encodings in a single string in defiance of the single piece of metadata describing the string encoding is about as bad as it gets in this regard), one is to charge them through the nose for fixing their mess and the other is telling them to gently caress off.

Weirdly enough treating these filenames as opaque series of bytes tends to work quite well, rather than assuming everyone uses your one true string encoding.


quote:

If the "thing we don't like with text" is actively preventing people from developing software usable by people outside a small group who speak en_US then "suck it" is a pretty mild response.

(It's more common on systems in Japan)

tef
May 30, 2004

-> some l-system crap ->

Lysidas posted:

On the other side, elements of sys.argv are str objects produced by os.fsdecode, and if you want the actual bytes back just immediately call os.fsencode.

I wasn't aware of os.fsdecode, this seems nice, although a little bit weird. I am unnerved by not having access to the raw bytes and hoping that the decoding process is reversible (is this guaranteed for all settings?)

Edit: And the manual says it uses 'strict' on windows, and 'surrogateescape' on unix, so I'm assuming this means your trick will only work on unix.


Lysidas posted:

Funny, I feel the exact same way about Python 2's Unicode model. What exactly do you think is wrong with using Unicode code points for everything text, and byte strings for everything binary?

Nothing, just that python2's binary support was a bit nicer for handling internet protocols. (Also PEP3333)

tef fucked around with this message at 12:19 on Jan 6, 2014

tef
May 30, 2004

-> some l-system crap ->
Ooh look http://python-notes.curiousefficiency.org/en/latest/python3/binary_protocols.html#binary-protocols

Zombywuf
Mar 29, 2008

tef posted:

(It's more common on systems in Japan)

Japan can use their own Python (i.e. Ruby) and stop loving it up for the rest of us.

Lysidas
Jul 26, 2002

John Diefenbaker is a madman who thinks he's John Diefenbaker.
Pillbug

tef posted:

I wasn't aware of os.fsdecode, this seems nice, although a little bit weird. I am unnerved by not having access to the raw bytes and hoping that the decoding process is reversible (is this guaranteed for all settings?)

Edit: And the manual says it uses 'strict' on windows, and 'surrogateescape' on unix, so I'm assuming this means your trick will only work on unix.

Yes, it's guaranteed to return the original bytes.

You're right that this only works on Unix. I was too hasty and shouldn't have suggested manually using fsdecode and fsencode unless you really know that you need to. You probably don't. Just omit it, since it'll automatically be used in OS interfaces that need bytes for filenames or command arguments. Use str objects everywhere -- this will also have the nice side effect of working correctly on platforms where filenames are natively stored as Unicode code points.

Python code:
>>> raw_filename = b'h\xe9llo.txt'
>>> contents = "here's some text I'm going to put in this file"
>>> with open(raw_filename, 'w') as f:
...   print(contents, file=f)
... 
>>> from os import fsdecode
>>> decoded = fsdecode(raw_filename)
>>> decoded
'h\udce9llo.txt'
>>> with open(decoded) as f:  # str automatically fsencoded
...   print(repr(f.read()))
... 
"here's some text I'm going to put in this file\n"
If you're producing the decoded filenames (from sys.argv or os.listdir/walk/whatever) and using them on the same system, just always use the str versions and don't worry about encoding anything yourself.

One of the few use cases that I can think of for manual fsencode is what Suspicious Dish said: remote management of a system with a different native encoding. I'm not even sure about this anymore, though. I think everything will work correctly even if you do something like the following:
  • Local system encoding: UTF-8
  • Remote system encoding: Shift-JIS
  • Remote filesystem mounted locally over NFS, with filenames containing Shift-JIS-encoded Japanese characters
  1. Call os.listdir('/remote/path') locally (note the str argument, which will cause filenames to be run through os.fsdecode). The decoded filenames will probably be full of surrogate escape code points, which will be ignored by anything like str.upper()
  2. Call a remote command with something like subprocess.Popen(['ssh', 'remote-host', command, filename_from_listdir]). Popen will fsencode its args, so that command is guaranteed to contain the actual bytes that make up the filename and this will be correct from the perspective of the command on the remote system

EDIT: remove sentence duplicated between paragraphs

Lysidas fucked around with this message at 16:58 on Jan 6, 2014

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

Zombywuf posted:

Japan can use their own Python (i.e. Ruby) and stop loving it up for the rest of us.

Hi shrughes!

Blotto Skorzany
Nov 7, 2008

He's a PSoC, loose and runnin'
came the whisper from each lip
And he's here to do some business with
the bad ADC on his chip
bad ADC on his chiiiiip
Red Hat is omakase,

QuarkJets
Sep 8, 2008

Zombywuf posted:

Japan can use their own Python (i.e. Ruby) and stop loving it up for the rest of us.

A programming language that is cross-platform but not cross-nationality

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed
Everywhere but Japan is still cross-nationality.

Zombywuf
Mar 29, 2008

When it comes to character encoding I'm pretty sure Japan is just trolling the rest of the world now.

Pollyanna
Mar 5, 2005

Milk's on them.


That's kinda par for the course for Japan, really.

That Turkey Story
Mar 30, 2003

astr0man posted:

As someone who used to do RE for a living, I can say with a certainty that it is easier to reverse C code than it is C++.

Yeah. If all C++ had over C was the addition of function overloading it would make things easier, but there's just so much complexity in C++ that in general I can't imagine it's easier overall.

Zhentar
Sep 28, 2003

Brilliant Master Genius
And that's even before the crazy compilers go mucking things up.

GCC apparently has some optimization for std::basic_string where it will pass around a pointer to the char buffer, rather than the start of the object; accessing the other member variables ends up being a negative offset rather than positive, as you'd expect. This lets it pass the string to functions taking a char* without any math.

Adbot
ADBOT LOVES YOU

That Turkey Story
Mar 30, 2003

Zhentar posted:

And that's even before the crazy compilers go mucking things up.

GCC apparently has some optimization for std::basic_string where it will pass around a pointer to the char buffer, rather than the start of the object; accessing the other member variables ends up being a negative offset rather than positive, as you'd expect. This lets it pass the string to functions taking a char* without any math.

libstdc++ used copy-on-write, but now it's pretty much universally accepted that the small string optimization is best in the general case.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply