Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Ika
Dec 30, 2004
Pure insanity

Xarn posted:

Maybe it is supposed to be like NaN and be fundamentally unequal? :v:

That was the first thing my coworker said as well when I showed him the code. The struct is meant to store persisted object state. Whoever mangled it also was storing whether the UI to edit the object was open in it. (In a double variable, set to 0.0 when the UI is closed, set to 2.0 for open, never initialized, and the test was > 0 for open instead of != 0.0 ...)

Some days....

Adbot
ADBOT LOVES YOU

hackbunny
Jul 22, 2007

I haven't been on SA for years but the person who gave me my previous av as a joke felt guilty for doing so and decided to get me a non-shitty av

Suspicious Dish posted:

And before you say "Windows has Unicode filenames", no, they have UCS-2 filenames.

Oh it's much more complicated than that. Filenames are surprisingly complicated on all platforms

Suspicious Dish posted:

Astral plane characters in filenames are basically not supported at all in most Windows apps.

I wouldn't say that. As long as you consider the names you get back from the filesystem as opaque, everything will work. String manipulations and encoding/decoding roundtrips can easily break them though

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
My understanding was that NTFS used "wide strings" for all filenames. I didn't think it was just as simple as "opaque bytes", but if it is, that kind of makes my argument easier. :)

I'd love to hear about Windows filename insanity. My personal experience with this comes from dealing with the NTFS-3g driver on Linux and having to deal with a bizarre edge case involving astral plane character filenames.

Kazinsal
Dec 13, 2011
You want to deal with a filename nightmare? Pulling files with astral plane characters in them off a FreeBSD ZFS box via tar -> scp -> untar onto a Windows machine is the worst loving thing I've done regarding filename encoding.

You'd think because it should be some kind of Unicode on both ends, it'd work well, but no. Somehow the multibyte encodings got mangled into codepage 437, the characters of which Windows promptly directly translated to what I *think* was UCS-2, but I'm not sure (eg. the UTF-8 sequence byte 0xB6 was interpreted as CP437, and became U+2562).

There are probably a number of things that Windows will happily munge in this way.

VikingofRock
Aug 24, 2008




hackbunny posted:

Oh it's much more complicated than that. Filenames are surprisingly complicated on all platforms

This sounds really interesting. For example, I thought on Linux filenames were basically just any sequence of bytes not containing nulls or '/', with the total path length kept under PATH _MAX bytes if you don't want things to start breaking. Is it significantly more complicated than that?

nielsm
Jun 1, 2009



VikingofRock posted:

This sounds really interesting. For example, I thought on Linux filenames were basically just any sequence of bytes not containing nulls or '/', with the total path length kept under PATH _MAX bytes if you don't want things to start breaking. Is it significantly more complicated than that?

If you want to display them you need to know the encoding used for the text.
If you transfer files between systems, or use removable media or foreign file systems, you need to know the remote encoding and translate for names to remain sensible to the humans.

SupSuper
Apr 8, 2009

At the Heart of the city is an Alien horror, so vile and so powerful that not even death can claim it.

Kazinsal posted:

You want to deal with a filename nightmare? Pulling files with astral plane characters in them off a FreeBSD ZFS box via tar -> scp -> untar onto a Windows machine is the worst loving thing I've done regarding filename encoding.

You'd think because it should be some kind of Unicode on both ends, it'd work well, but no. Somehow the multibyte encodings got mangled into codepage 437, the characters of which Windows promptly directly translated to what I *think* was UCS-2, but I'm not sure (eg. the UTF-8 sequence byte 0xB6 was interpreted as CP437, and became U+2562).

There are probably a number of things that Windows will happily munge in this way.
Windows encodings are their own little hellhole.

LOOK I AM A TURTLE
May 22, 2003

"I'm actually a tortoise."
Grimey Drawer
At my old work we had some weird issue with Windows filenames that I don't think we ever totally figured out. Minidump files with long, seemingly randomly generated names were showing up in the client installation directories. Sometimes those filenames would contain line breaks and other unprintable characters, and I think maybe even null bytes, but it was hard to tell since a lot of tools that deal with the file system would fail miserably when they encountered these files. The reason we discovered that this was happening to begin with was that we were getting errors from log4net, which turned out to be because it was running a System.IO.FileSystemWatcher (a .NET library class) on the directory, and that would throw an unhandled exception when the directory contained a file with one of these murderous names.

You're not supposed to be able to create files with names like that on Windows, but somehow it was happening. I had a theory that it was caused by something like a buffer overflow somewhere, and that the filenames were random bits of memory. I think we may have eventually determined that they came from some Oracle DBMS component we were depending on, but I'm not 100% sure. I'd love to know how it could've happened.

Ralith
Jan 12, 2011

I see a ship in the harbor
I can and shall obey
But if it wasn't for your misfortune
I'd be a heavenly person today
Just as Linux filenames are byte strings with a few things excluded, IIRC Windows filenames are 16-bit units with a few things excluded. It's easy to imagine that you can interpret them as UTF-16 or whatever, but they can contain fun things like lone surrogates, so you're best off treating them as opaque.

Carbon dioxide
Oct 9, 2012

Oh that reminds me, I ran into a filename problem when implementing some file upload page into a website that runs in TomCat.

While Firefox and Chrome put the file and the filename in the request as expected, Internet Explorer/Edge gave the full path as the filename. So, if I uploaded a file in Firefox, I got the filename meme.jpg, but in IE/Edge the filename would be C:\My Documents\No porn here\images\meme.jpg.

What the hell, Microsoft? So, my first idea was using some File utilities library that was on the classpath already and use some method it had to strip the path and return the clean filename. That worked well on my local Windows machine... but once the site was running on the server, nope. Turns out that that method used the Java system property of the OS it was running on, the server is a linux machine, and when running on Linux, the method didn't recognize backslash as a valid path separator character. I could force it to use Windows paths but I was afraid that could cause other problems for users on other OS's.

In the end I decided to go for the very simple solution of "strip everything before any backslash in the filename". In the extremely rare case that someone uploads something with a filename that has an actual backslash in it, it's stripped but who cares.

Impotence
Nov 8, 2010
Lipstick Apathy
I thought all browse uploads got something roughly eqiuvalent to C:\fakepath?

ErIog
Jul 11, 2001

:nsacloud:

Carbon dioxide posted:

Oh that reminds me, I ran into a filename problem when implementing some file upload page into a website that runs in TomCat.

While Firefox and Chrome put the file and the filename in the request as expected, Internet Explorer/Edge gave the full path as the filename. So, if I uploaded a file in Firefox, I got the filename meme.jpg, but in IE/Edge the filename would be C:\My Documents\No porn here\images\meme.jpg.

What the hell, Microsoft? So, my first idea was using some File utilities library that was on the classpath already and use some method it had to strip the path and return the clean filename. That worked well on my local Windows machine... but once the site was running on the server, nope. Turns out that that method used the Java system property of the OS it was running on, the server is a linux machine, and when running on Linux, the method didn't recognize backslash as a valid path separator character. I could force it to use Windows paths but I was afraid that could cause other problems for users on other OS's.

In the end I decided to go for the very simple solution of "strip everything before any backslash in the filename". In the extremely rare case that someone uploads something with a filename that has an actual backslash in it, it's stripped but who cares.

Why are you allowing user-input to be reflected on the disk at all? The filename of whatever's uploaded should go in some structured data store where the file name is some kind of hash.

The filename itself should be escaped/sanitized and kept in a DB as metadata.

I understand this is probably some kind of internal tool, though, and you might not be able to call all the shots to do things properly.

ErIog fucked around with this message at 08:15 on Nov 28, 2016

hackbunny
Jul 22, 2007

I haven't been on SA for years but the person who gave me my previous av as a joke felt guilty for doing so and decided to get me a non-shitty av

VikingofRock posted:

This sounds really interesting. For example, I thought on Linux filenames were basically just any sequence of bytes not containing nulls or '/', with the total path length kept under PATH _MAX bytes if you don't want things to start breaking. Is it significantly more complicated than that?

Alright. It's a mess. Windows filenames are, nominally, UTF-16. This is not true: invalid UTF-16 strings (e.g. mismatched surrogates) are valid filenames. It's not UCS-2 either: reserved or unassigned codepoints are fine. This is extremely important if you have to convert a filename to a non-opaque representation: there are filenames that just plain won't convert to Unicode (not UCS-2, not UTF-16, not UTF-8: Unicode, in general) without loss. You can create files that are completely inaccessible to the .NET runtime, for example, because .NET converts all filenames to Unicode and back, and this conversion is lossy.

Windows filenames are case insensitive. Windows does not use Unicode case insensitivity: it has its own internal tables. Also, unlike most implementations of case insensitivity, Windows doesn't use lowercase folding, but uppercase. Lowercase-uppercase roundtrips are lossy, too, and locale-sensitive: most notably, in the Turkish locale, unlike all others, i uppercases to İ, and I lowercases to ı, i.e. the dot on the I is not just decoration but distinguishes between two different vowels. Case folding has surprising properties in general, for example lowercase ligatures like ß and fi turn into two characters when uppercased (SS and FI, respectively), changing the length of the string (this may or may not result in different case insensitive sort orders, depending on the character in question, the locale, etc.). Windows case insensitivity of filenames sidesteps most of these issues by being locale-insensitive. NTFS volumes introduce a complication, in that each volume technically has its own private uppercase table, but I think (hope!) in practice the driver always uses the running kernel's table. So, if you are on Windows and you are comparing filenames with stricmp, you aren't getting the same results the operating system does (use RtlCompareUnicodeString instead. It's undocumented for use in user mode, but, who cares? You're really supposed to use CompareString with the right magic combination of flags, but who has time for that?)

Windows architecture is deeply layered, and so far we've only looked at the Object Manager (Ob) layer, which is one of the main user/kernel interfaces: opening and creating files, managing the registry, enumerating devices are only some of the many features accessed through the Object Manager. To finish our look at the Object Manager: "filenames" (really object names) can contain any character except for backslash, and no string (be it an object name or a path) can be longer than 64 Kb (and yes it's measured in bytes, not characters)

Before a file create/open operation reaches the filesystem, first it goes through the Object Manager, and then usually through the filesystem runtime library (FsRtl). FsRtl imposes a few more rules on filenames: no ASCII control characters, no ", *, /, <, >, ?, \ or |. FAT further limits the set of legal characters, I forget exactly how, but I remember that it makes : illegal (which is legal in NTFS because it's used in the stream/type syntax). I say usually because some internal, non-disk filesystems don't use FsRtl and will take literally any name. An example is the named pipes filesystem: named pipes are files in the root directory of this filesystem, and their names can even contain backslash, because the named pipes filesystem doesn't have subdirectories (fun fact: anonymous pipes are, in fact, named pipes with unique names)

After a filename has gone through Ob and FsRtl, it's passed to the filesystem driver, which complicates matters even further. Leaving aside network filesystems for now, where anything can happen, let's have a look at boring old NTFS. NTFS parses its filenames, they aren't just plain strings. If a filename contains a colon, then the part before the colon is the unique filename, and the part after is the stream name (aside from the default, anonymous stream, NTFS files can have multiple named streams. What a spectacularly bad idea). If a filename contains two colons, then the stream name is the string between the two colons, and what comes after the second colon is the stream type (the only legal stream type I can think of right now is $DATA, which is the file's data and the default type if omitted). file and file::$DATA are, therefore, the same file, and good luck finding that out from the filename alone

FAT is a cranky old filesystem dating back to QDOS and full of quirks. FAT filenames are 1-8 characters long, with an optional extension of 1-3 characters, stored in uppercase in an unspecified, machine-specific ASCII-based encoding ("OEM"), usually the same encoding used by the VGA BIOS (!)... or what the VGA BIOS would use if the machine had one. Volumes supporting the LFN (Long Filename) extension, now the norm, can give a file a second, user-friendlier name of up to 255 characters, in the usual opaque 16-bit Windows encoding that's not quite UCS-2. By parsing two strings, one compliant with the 8.3 format and one not, it's impossible to tell if they identify the same file in a FAT directory, because the two names of a FAT file can be completely unrelated. I also think that, to the FAT driver, "file" and "file." are the same file, because they both are 8.3 names with an empty extension (the . is not stored but implied). Even if there wasn't FsRtl in front of FAT, even if it didn't parse filenames to split them into name and extension, FAT reserves at least one character. You see, you can't delete files from a FAT volume: what you actually do is hide the file and mark it as reusable by replacing the first letter of its name with a control character (I forget which)

If it wasn't complicated enough, applications don't (generally) use the Ob layer directly, but they use it through the Win32 layer. Win32 adds a whole heap of complications to filenames, by introducing four distinct syntaxes for absolute paths and a ton of special cases. The four kinds of absolute Win32 paths are:
  • drive path: X:\.... Length limited to 260 characters, . and .. entries are collapsed, forward slashes converted to backslashes, multiple backslashes collapsed in one, trailing dots in entries are stripped. This path has the worst quirk of all: if any entry, without extension, matches a DOS device name (LPTx, COMx, AUX, NUL, PRN...), then the whole path is thrown away and the DOS device used instead. If a drive path survives the gauntlet of transformations, it's converted to an Ob path of the form \??\X:\... (or \??\DOS device, if it resolved to a DOS device, e.g. \??\COM1). Amusingly, the drive letter doesn't have to be a letter at all, any single not-quite-UCS-2 character will do, even the colon itself
  • UNC path: \\host\share\.... Like a drive path, but rooted at \\host\share instead of X:. Converts to an Ob path of the form \??\UNC\host\share\...
  • device path: \\.\device\.... Converts to \??\device\... with no other transformation performed, because a device's namespace is completely arbitrary (see named pipes above), but still limited to 260 characters. Note that drive letters, too, are device names, and the \\.\X:\... escape syntax can let you access filenames that are otherwise inaccessible (the equivalent for UNC paths is, of course, \\.\UNC\host\share\... - yes, "UNC" is a device, called the redirector)
  • Ob escape: \\?\.... Directly converts to \??\.... Literally all that's done is replacing the second \ with a ?. Gives almost full access to the Ob namespace; filenames with embedded NUL characters still inaccessible, because Win32 uses NUL-terminated strings. Can't be used everywhere because some functions don't support it

Applications often use Win32 through yet another layer provided by their framework or some library, which introduces further complications. For example, as I've already said, filenames that don't convert to Unicode are completely inaccessible to .NET. And this isn't even getting into showing paths to the user, or even worse deriving paths from user input, or from manipulating existing paths

To top it off, font issues add that little bit of spice: Japanese fonts show the backslash character as a Yen sign. It works as a backslash in all aspects, except the user believes it's a Yen sign. I believe it's visually indistinguishable from the real Yen sign character, which must be fun to handle

I know other operating systems much less than Windows, but they're messy too. Never as messy as Windows, but still not foolproof

Linux, much like Windows, internally treats all paths as opaque, only making a special case for the path separator (here / instead of \), and the NUL character, which isn't legal in Linux. As we've seen, the kernel namespace is only one of the many, many layers involved: what about the filesystems themselves? The Linux FAT filesystem driver, for example, doesn't just return LFN filenames as raw binary dumps of the not-quite-UCS-2 strings (not that it could - they very often contain embedded NUL bytes), because they would be useless to users

Darwin requires legal Unicode strings for filenames, and enforces the Unicode normal form NFD. Not only this doesn't solve the issue of what to do with filenames that can't be converted to Unicode, but it also changes the binary representation of filenames: if you create a file named "ò" (one codepoint, two bytes in UTF-8), it's actually created as "ò" (two codepoints, three bytes in UTF-8), because NFD splits precomposed characters into their base form plus their combining marks. This is almost invisible to the user (indeed Darwin probably has the sanest way to deal with filenames, and as a side effect incentives user interfaces to properly split strings into grapheme clusters), but it can trip up programs that don't know how paths are actually compared under Darwin. In particular, this makes Linus Torvalds extremely angry. Oh and paths, in Darwin, may or may not be case-insensitive, and don't ask me what folding algorithm it's used because I don't know. Fun!

Filenames are opaque data. There's no way around it. Converting them to strings is lossy, converting them back is lossy too

hackbunny fucked around with this message at 16:00 on Nov 28, 2016

Carbon dioxide
Oct 9, 2012

ErIog posted:

Why are you allowing user-input to be reflected on the disk at all? The filename of whatever's uploaded should go in some structured data store where the file name is some kind of hash.

The filename itself should be escaped/sanitized and kept in a DB as metadata.

I understand this is probably some kind of internal tool, though, and you might not be able to call all the shots to do things properly.

Yeah it is. I only deal with the web side of things. In this case that means a basic validation on uploaded files and then forwarding them to a service that does more validation and talks to a database system that's not under my control. I also have to show the user a list of the names of files they have uploaded (gotten from the db service) without having weird filenames break the page. The rest is not in my hands.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell


We're all doomed.

Ghost of Reagan Past
Oct 7, 2003

rock and roll fun
It is a miracle that anything works at all, apparently.

1337JiveTurkey
Feb 17, 2005

I'm pretty sure that the character that FAT uses to indicate that a file is deleted is å in the western codepage. There's a different control character which is used to indicate that you really want a file which begins with å because that makes sense. Also the original filesystem FAT was based on used six character file names with a three character extension, with each character only being six bits. This was because it was running on a 36 bit computer and that makes a file name a full word and the extension a half word.

xtal
Jan 9, 2011

by Fluffdaddy
Not posting the easily-identifiable code here but I'm reviewing some code where they iterate over a hash instead of indexing it conventionally. Every lookup is the same 6-line combination of for, variable assignment and break

Space Kablooey
May 6, 2009


Not trying to be snarky, but where do you guys work where FAT is still a thing?

1337JiveTurkey
Feb 17, 2005

HardDiskD posted:

Not trying to be snarky, but where do you guys work where FAT is still a thing?

USB sticks are usually FAT. Really it doesn't matter since you should just treat filenames as opaque strings which are meaningful to the user.

Linear Zoetrope
Nov 28, 2011

A hero must cook
Computers were a mistake

Xarn
Jun 26, 2015

Ghost of Reagan Past posted:

It is a miracle that anything works at all, apparently.
You didn't already know that? :v:


xtal posted:

Not posting the easily-identifiable code here but I'm reviewing some code where they iterate over a hash instead of indexing it conventionally. Every lookup is the same 6-line combination of for, variable assignment and break

Boooooring. In my previous job, we even wrote a tool that found those in the code we got from outsourced developers. :suicide:

pseudorandom name
May 6, 2007

There's an entire Project Zero post on Windows path name resolution, in case y'all want more detail.

hackbunny
Jul 22, 2007

I haven't been on SA for years but the person who gave me my previous av as a joke felt guilty for doing so and decided to get me a non-shitty av

pseudorandom name posted:

There's an entire Project Zero post on Windows path name resolution, in case y'all want more detail.

Haha I keep forgetting that \??\ paths are legal too. What a singularly bad idea. But AFAIK the \??\GLOBALROOT symlink is only accessible from privileged processes (which ironically makes privileged processes that take user input easier to subvert...)

SupSuper
Apr 8, 2009

At the Heart of the city is an Alien horror, so vile and so powerful that not even death can claim it.
And you better hope nobody's been playing around with NTFS features.

JewKiller 3000
Nov 28, 2006

by Lowtax

1337JiveTurkey posted:

USB sticks are usually FAT. Really it doesn't matter since you should just treat filenames as opaque strings which are meaningful to the user.

*puts a \0 in your "opaque" string*

ulmont
Sep 15, 2010

IF I EVER MISS VOTING IN AN ELECTION (EVEN AMERICAN IDOL) ,OR HAVE UNPAID PARKING TICKETS, PLEASE TAKE AWAY MY FRANCHISE

JewKiller 3000 posted:

*puts a \0 in your "opaque" string*

Works out if you're in the NT kernel, since it uses counted strings per the linked article.

hobbesmaster
Jan 28, 2008

:dogbutton:

maybe I'll just use inode indexes from now on

hyphz
Aug 5, 2003

Number 1 Nerd Tear Farmer 2022.

Keep it up, champ.

Also you're a skeleton warrior now. Kree.
Unlockable Ben

Jsor posted:

Computers were a mistake

A year or two back I got assigned to teach OS Fundamentals and it was the most depressing thing ever because of this kind of thing. "Here's something you thought was simple. Now here's the ridiculous complexity behind it. Now here's the actual implementations which are full of cruft. Now here's the workarounds for rear end in a top hat programmers redlining the OS."

Edison was a dick
Apr 3, 2010

direct current :roboluv: only

hyphz posted:

A year or two back I got assigned to teach OS Fundamentals and it was the most depressing thing ever because of this kind of thing. "Here's something you thought was simple. Now here's the ridiculous complexity behind it. Now here's the actual implementations which are full of cruft. Now here's the workarounds for rear end in a top hat programmers redlining the OS."

I know the feeling. I spent the last 4 months (of half an evening a week) writing about all the complexities in moving a file in Linux when you have to be able to handle the possibility of the destination not existing.

xtal
Jan 9, 2011

by Fluffdaddy
C is like the JavaScript of low-level. So glad Rust obsoleted it

TooMuchAbstraction
Oct 14, 2012

I spent four years making
Waves of Steel
Hell yes I'm going to turn my avatar into an ad for it.
Fun Shoe

Optional curly brackets are and have always been a coding horror. Saving yourself two keystrokes and one line (or two lines if you put { on its own line) is not worth the sheer amount of headaches induced by a construct that effectively imparts special meaning to the next semicolon.

xtal
Jan 9, 2011

by Fluffdaddy
If null was the billion dollar mistake C was the trillion dollar mistake

JawnV6
Jul 4, 2004

So hot ...
Source your quotes.

Space Kablooey
May 6, 2009



Why can't you define a function instead of using macros? :psyduck:

sarehu
Apr 20, 2007

(call/cc call/cc)

HardDiskD posted:

Why can't you define a function instead of using macros? :psyduck:

In C sometimes it's because you can't write templated functions or overloaded functions, but a macro will suffice.

Edison was a dick
Apr 3, 2010

direct current :roboluv: only

HardDiskD posted:

Why can't you define a function instead of using macros? :psyduck:

Type dynamism? The macro would work for anything that has a state field, and it's one reason for macros in the libc.

Conciseness? There's some argument for macros that are simple expressions, though if you need to wrap it in do-while I argue it can't be for this reason.

In my experience it's mostly because of folk knowledge on how effective inlining is.

1337JiveTurkey
Feb 17, 2005

Edison was a dick posted:

In my experience it's mostly because of folk knowledge on how effective inlining is.

Also saving time typing register so drat much.

vOv
Feb 8, 2014

sarehu posted:

In C sometimes it's because you can't write templated functions or overloaded functions, but a macro will suffice.

Or you want to return from the function containing the macro.

Adbot
ADBOT LOVES YOU

Fergus Mac Roich
Nov 5, 2008

Soiled Meat
http://research.swtch.com/shmacro

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply