Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Jabor
Jul 16, 2010

#1 Loser at SpaceChem
How are people connecting to these devices? By IP address? What sort of attacks are you trying to protect against by enabling HTTPS?

Adbot
ADBOT LOVES YOU

Trapick
Apr 17, 2006

fankey posted:

Not exactly a programming question but figured this might be the best place to ask. We sell a Linux 'applicance'. This device has an embedded web server for configuration. We'd like to support HTTPS out of the box but I'm not sure the best way ( if even possible ) to ship with a usable certificate. Options I've come up with

  • Self signed. Browsers freak out which doesn't give users a good feeling
  • Individual signed cert. Given that we could be selling 10k+ units a year I'm not sure this is even feasible. These devices won't have valid domain names so letsencrypt is out.
  • Ship every device with the same signed cert. Not sure if this is considered good practice or not.
  • Let the user install their own cert and use HTTP until they do. Given our user base this really means that no one will be using HTTPS.
Is this appliance meant to be accessible only over the local private network? Don't think you can individually sign certs (with a valid CA at least) if they'd be accessing by an internal IP address. Same with shipping with the same signed cert - what would your Common Name be?

Fake Edit: see https://www.globalsign.com/en/blog/certificates-for-internal-servers/ ; doesn't look like there are great options for this.

Edit: except maybe https://www.entrust.com/private-ssl-certificates/ ? Not super clear, I'm pretty sure you'd need DNS entries on the local network to make that work, and at that stage you could just have the user install their own cert.

Trapick fucked around with this message at 19:16 on Oct 7, 2016

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

Anyone got any Rx experience? I'm trying to get something working with a chain of observables, where each one combines several inputs and feeds into the next (which also has other inputs), so any new events propagate down to the last observable. The trouble is some of these inputs appear in multiple places in the sequence, and I want things to propagate in order - if an observable is an input to the first observable and the last, I want the first observable to react earlier, emit its result, and when the last observable reacts to the new event, it'll have the updated result from the previous observable too.



So B is an input to C and E - I'm looking at using CombineLatest so each stage is using the most current values of each input. So I want C to react to B's new value before E does, because I want C to update and have a new value for E to work with, along with the new B. If E reacts first, it does work with the new B and the old C, and then does it again when the new C data comes in.

What's the best way to approach this? Should I be breaking E's direct reliance on B, and making it dynamically subscribe to B for a single value every time? That feels really bad, and it obscures the natural dependency E has on B, purely because an earlier dependency also depends on it. I just want to impose an order, and make sure each stage completes its observing work before the next

I'm still pretty new to this and I might be trying to crowbar Rx into a situation it's not suited for, but it seems like there should be a solution where you have intermingled dependencies like this. Or are you meant to just let them reprocess and filter out multiple final results with DistinctUntilChanged?

Bob Morales
Aug 18, 2006


Just wear the fucking mask, Bob

I don't care how many people I probably infected with COVID-19 while refusing to wear a mask, my comfort is far more important than the health and safety of everyone around me!

fankey posted:

Not exactly a programming question but figured this might be the best place to ask. We sell a Linux 'applicance'. This device has an embedded web server for configuration. We'd like to support HTTPS out of the box but I'm not sure the best way ( if even possible ) to ship with a usable certificate. Options I've come up with

[list]
[*]Self signed. Browsers freak out which doesn't give users a good feeling

BIG_FIREWALL_COMPANY does this for their devices

mystes
May 31, 2006

fankey posted:

Not exactly a programming question but figured this might be the best place to ask. We sell a Linux 'applicance'. This device has an embedded web server for configuration. We'd like to support HTTPS out of the box but I'm not sure the best way ( if even possible ) to ship with a usable certificate. Options I've come up with

  • Self signed. Browsers freak out which doesn't give users a good feeling
  • Individual signed cert. Given that we could be selling 10k+ units a year I'm not sure this is even feasible. These devices won't have valid domain names so letsencrypt is out.
  • Ship every device with the same signed cert. Not sure if this is considered good practice or not.
  • Let the user install their own cert and use HTTP until they do. Given our user base this really means that no one will be using HTTPS.
Aside from a self-signed certificate (which will add no security), none of these will even work. If you really want to somehow get SSL to work out of the box, the ONLY option would be to do so using a domain name that you control and running a proxy that connects to the appliance, but this clearly will cause more problems than it will solve. Just use HTTP.

Gravity Pike
Feb 8, 2009

I find this discussion incredibly bland and disinteresting.

mystes posted:

Aside from a self-signed certificate (which will add no security), none of these will even work. If you really want to somehow get SSL to work out of the box, the ONLY option would be to do so using a domain name that you control and running a proxy that connects to the appliance, but this clearly will cause more problems than it will solve. Just use HTTP.

A self-signed cert will add encryption, but not authentication. It'll stop people on the LAN from sniffing passwords, which isn't nothing.

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

Gravity Pike posted:

A self-signed cert will add encryption, but not authentication. It'll stop people on the LAN from sniffing passwords, which isn't nothing.

SSL/TLS actually supports a mode without any certificate at all specifically for this reason - encryption without authentication is still better than nothing at all. Sadly, literally ever web browser explicitly disables this mode. I often wish that they allowed it and just didn't show the lock icon at all or something.

Fergus Mac Roich
Nov 5, 2008

Soiled Meat

ShoulderDaemon posted:

SSL/TLS actually supports a mode without any certificate at all specifically for this reason - encryption without authentication is still better than nothing at all. Sadly, literally ever web browser explicitly disables this mode. I often wish that they allowed it and just didn't show the lock icon at all or something.

It should show a house of cards instead.

Nippashish
Nov 2, 2005

Let me see you dance!

ShoulderDaemon posted:

SSL/TLS actually supports a mode without any certificate at all specifically for this reason - encryption without authentication is still better than nothing at all. Sadly, literally ever web browser explicitly disables this mode. I often wish that they allowed it and just didn't show the lock icon at all or something.

What bizzaro rationale could they possibly have for disabling this?

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

Nippashish posted:

What bizzaro rationale could they possibly have for disabling this?

In short, there's a downgrade attack possible on human operators.

If you support unauthenticated traffic on the same kind of connection as authenticated traffic, then when you try to connect to a server that advertises authentication, an adversary could MITM the connection, remove the authentication by reencryption, and present you with modified data. The browser would, presumably, correctly mark the data it was seeing as "insecure", but the (somewhat legitimate) argument is that browsers aren't really the things making policy decisions, humans are, and human operators seem to be really bad on the whole at making good decisions involving whether or not to trust websites.

Fundamentally, this is also why modern browsers have been getting more and more annoying about viewing websites with self-signed certificates - it's the same problem. If you make the warning about a self-signed certificate not very annoying, then a large population of users will just ignore the warning entirely, allowing MITM attacks which strip all meaningful security by simply replacing a legitimate certificate with a self-signed certificate owned by the adversary.

Note that this problem doesn't occur in other protocols which use SSL/TLS such as SMTP. In SMTP you can upgrade connections to encrypted/authenticated after the connection has been established by sending a "STARTTLS" command - an adversarial MITM could trivially intercept this command and simply respond "sorry not supported" to whichever peer sent it. But policy decisions involving SMTP such as "don't send email to a connection I haven't authenticated" are performed by software which is much more difficult to trick than a human, so it doesn't matter. If your SMTP connection has been MITMed such that it can no longer be authenticated, then your software will know that and will make the correct decision according to your email policy. And sure enough, lots of SMTP daemons and clients support SSL/TLS encryption without authentication, and treat it exactly the correct way (as if it were not meaningfully protected).

"Fixing" this for the web is probably impossible at this point; general consensus seems to be that you just can't trust random human operators to look at anything and make a meaningful security decision. The closest thing to a "right way" to fix this that I can think of would be incredibly draconian policies whose ship has long since sailed like "web browsers shouldn't ever submit forms to non-authenticated servers" which would obviously break a very large portion of the existing web. The web is based on a very generic protocol which has no way to identify "sensitive data" and can only make the most trivial kinds of policy decisions, and the interface that websites use to show data is extremely freeform and variable. When you're confronted with typical users being extremely aggressive about bypassing warnings without reading them as a matter of course, this makes a fairly untenable situation, and the browser makers have wound up in this losing war where they are trying to remove enough dangerous features to prevent the most obvious traps from routinely working.

I think if I was designing a web browser now I'd try something like accepting encryption without authentication and self-signed certificates without blinking, but treating them as if there was no security layer. I'd forbid entirely rendering pages with mixed authenticated/non-authenticated content. And as soon as a user started to type into a form on a page in non-authenticated mode, I'd beep and freeze input for 3 seconds and replace the entire page during that time with a giant emoji of a burglar or something. If the user keeps typing after that, then whatever, they've been warned that someone is gonna steal whatever they're trying to type.

And I can already think of at least one way around that. If an adversary MITMed a bank's website, and provided a login page to the user with an "onscreen keyboard" made using JavaScript and some text that said "Please use the provided keyboard for security" I bet 90+% of users would not only go right ahead and give that page their PIN, they'd feel good about doing so. Hell, one of the banks I used in the past had exactly that setup for some godforsaken reason. The only even marginally feasible mechanic I can think of to bypass that would be disabling JavaScript for non-authenticated content, and that ship won't sail.

Nippashish
Nov 2, 2005

Let me see you dance!
That's an annoyingly sensible answer, thanks for explaining.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

What is the reason so many banks are so seemingly terrible at handling their web sites?

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

ShoulderDaemon posted:

SSL/TLS actually supports a mode without any certificate at all specifically for this reason - encryption without authentication is still better than nothing at all.

It's not. It's actually worse than plaintext, in that it might give users some form of false security.

Steve French
Sep 8, 2003

Suspicious Dish posted:

It's not. It's actually worse than plaintext, in that it might give users some form of false security.

That's making an assumption about how it is presented to the user, is it not? Generally I definitely agree with you, but all else equal in terms of how a connection is represented to a user, you'd still prefer encrypted and unauthenticated over plain text, which is what was just suggested.

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

Suspicious Dish posted:

It's not. It's actually worse than plaintext, in that it might give users some form of false security.

Eh, I think this is just the downgrade issue I discussed above. I agree that you shouldn't advertise encryption without authentication to the user - it should appear exactly the same as a completely unsecured connection. Given that, I would always prefer to use encryption when possible; it prevents a large number of attacks which are not MITMs, which is to say, it helps. As attacks go, MITMs are a minority. It's problematic to say we should increase our vulnerability surface in order to make it more obvious to human operators, most of whom will not understand either way.

In the case of HTTP that's not good enough because of the concerns you'd get from introducing downgrades into HTTPS, but that's because the web ecosystem is sorta crazy, as I wrote above. It's the same reason HTTPS is a separate port and protocol than HTTP instead of using a STARTTLS-like construct like every other modern protocol does; what little security model we have for web traffic now irrevocably depends on being able to draw a hard line between the secure and insecure web, and anything that breaches that is going to cause users to do the wrong thing. We've engineered ourselves into this situation where untrained human operators are required to make continuous security judgements and allowing a https:// link to point to an unauthenticated resource would severely hamper what little ability they have to do so, and I don't think there's an easy way out.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
What attacks are you potentially talking about? Passive snooping by network operators that can't afford trivial MITMs?

The way I see it, unauthenticated HTTPS provides no extra security over HTTP, so why allow its use at all?

I would say that unauthenticated HTTPS is also dangerous because it allows people like fankey be able to check off the "uses HTTPS" checkbox and think that they're secure.

MrMoo
Sep 14, 2000

Thermopyle posted:

What is the reason so many banks are so seemingly terrible at handling their web sites?

It's simply any big organization, look at IBM's or Microsoft's site they're impressively terrible and they're tech companies.

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

Suspicious Dish posted:

What attacks are you potentially talking about? Passive snooping by network operators that can't afford trivial MITMs?

For large networks, the idea of running MITM on every SSL connection becomes prohibitively expensive very quickly. MITM is cheap for directed attacks - those targeting either a small number of users or a small number of sites. Encryption without authentication does a really good job of protecting against, for example, wide-scale keyword searches on all network traffic by an ISP, or long-term collection of all user behavior, or capturing of sensitive data which is accidentally transmitted down the wrong channel (e.g. "whoops I had the wrong window focused when I pasted my account info, better delete that facebook post") or other more opportunistic attacks. Many of these attacks are really inexpensive compared to wide-scale MITM, and can be much harder to detect.

Suspicious Dish posted:

I would say that unauthenticated HTTPS is also dangerous because it allows people like fankey be able to check off the "uses HTTPS" checkbox and think that they're secure.

This is absolutely a legitimate concern, especially in the context of the web. Web security is garbage because it requires everyone involved to understand what's going on, and encryption without authentication would be an easy trap unless it came with bigger warning bells than we can easily deploy. FWIW, I think this is equally true of self-signed certificates, and I think the two should be treated identically - not advertised as secure in any way, not allowed to be mixed with other content, some way of forcing users to acknowledge that anything they transmit is likely to be compromised.

carry on then
Jul 10, 2010

by VideoGames

(and can't post for 10 years!)

MrMoo posted:

It's simply any big organization, look at IBM's or Microsoft's site they're impressively terrible and they're tech companies.

I can speak from experience and say that you can make two entire web application servers and still be awful at web applications.

Trapick
Apr 17, 2006

Thermopyle posted:

What is the reason so many banks are so seemingly terrible at handling their web sites?
Three reasons. First, banks are run by bankers, who are old and non-technical and don't give a poo poo about web sites. Second, basically no one chooses (or leaves) a bank based on the web site. Third, if you have to go into a branch or call the bank as opposed to being able to do a thing online, the bank has an oppurtunity to sell you something.

I work in this field (we do sites and apps for credit unions), so if you have any more specific questions shout at me.

fankey
Aug 31, 2001

Thanks for all of the discussion about my certificate question. My knowledge is now slightly higher than the 'read wikipedia' level I was at before. The devices we sell are pretty much guaranteed to be given a private IP and won't even have a qualified name. It looks like there are options for private IP trusted certs but given that they need to be deployed on the companies network they don't really work for me since we are a manufacturer.

My leading hypothesis is that we enable HTTP and HTTPS and allow the end user to install their own certificates if they wish to lock things down.

Bob Morales
Aug 18, 2006


Just wear the fucking mask, Bob

I don't care how many people I probably infected with COVID-19 while refusing to wear a mask, my comfort is far more important than the health and safety of everyone around me!

Thermopyle posted:

What is the reason so many banks are so seemingly terrible at handling their web sites?

Probably because it has to interface with LEGACY_SYSTEM

raminasi
Jan 25, 2005

a last drink with no ice

Bob Morales posted:

Probably because it has to interface with LEGACY_SYSTEM

That's no excuse for, say, a multifactor authentication system that combines the second factor with a separate password such that you have no way of knowing whether you've forgotten the second password or just mistyped the code. And then locks you out after two failed login attempts.

MiniFoo
Dec 25, 2006

METHAMPHETAMINE

I need some help with bash scripting and regular expressions. Disclaimer: I work in IT, not programming.

To avoid the dreaded "XY Problem" and get to the point, we have a client that is moving their data from their current storage to a ReadyNAS combined with Egnyte CFS. In short, I used rsync to successfully transfer all 8.5 TB of data to the ReadyNAS without any hitches, but now there's roughly 58,000 items (files and directories) on it that cannot be uploaded to Egnyte due to illegal characters and naming conventions. Therefore, we need a script that can either be run directly to the ReadyNAS over SSH, or on a Mac with the ReadyNAS drive mapped via SMB, in order to rename those files and folders so they're compatible. My goals:

1. Delete leading and trailing spaces (i.e. /User/ Documents/ and /User/Documents /)
2. Delete hidden control characters (there are ton of folders with font packs in them containing weird poo poo like that for some reason)
3. Replace the characters listed in the above link ( \ / " : < > | * ? + ) with " - " (space dash space, or at least just a dash)

After reading up a ton on sed and following a bunch of stackexchange posts, I've come up with this:
code:
#!/bin/bash                                                                        

IFS=$'\n'
for file in $(find -d . -name "* ")
do
  target_name=$(echo "$file" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//;s/[[:cntrl:]]*//;s/[\\/":<>|*?+]*/ - /')
  if [ "$file" != "$target_name" ]; then
      if [ -e $target_name ]; then
          echo "WARNING: $target_name already exists, file not renamed"
      else
          echo "Renaming $file to $target_name"
          mv "$file" "$target_name"
      fi
  fi
done
Obviously, the sed command needed to make this work is tricky, and I think I messed up somewhere along the line. Not only that, but I'm not sure at all that the for and if functions are set up correctly. Running the script as --verbose, it seems to parse the test directory I ran it on correctly (properly recognizing "/TrailingSpaceFolder " and making $target_name "/TrailingSpaceFolder", but then do some weird poo poo while trying to actually rename it, ultimately breaking the script and stopping it.

I'm on a PC and not on my MacBook at the moment, but I'll share some screenshots later on if necessary. Does any of it look blatantly wrong?

Rocko Bonaparte
Mar 12, 2002

Every day is Friday!
Are there still any major websites using CGI for the pages? I am trying to contrast Django with the entirity of CGI, and it looks partisan to have ZERO examples to counterbalance even just Django--let alone MVC frameworks.

Edit: OMG I think Paypal still uses CGI. :stare:

Rocko Bonaparte fucked around with this message at 14:32 on Oct 12, 2016

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

Rocko Bonaparte posted:

Are there still any major websites using CGI for the pages? I am trying to contrast Django with the entirity of CGI, and it looks partisan to have ZERO examples to counterbalance even just Django--let alone MVC frameworks.

Edit: OMG I think Paypal still uses CGI. :stare:

CGI is just a protocol for a webserver to communicate with a webapp. You can't tell from looking at URLs or headers if CGI is being used. You can't compare it to something like Django, except insofar as Django might be talking to the webserver using CGI under the hood. Nowadays CGI is fairly rare because it's kind of slow, but it's not unheard of.

Back in the day, a lot of webservers used the convention that URLs ending in .cgi corresponded to a single-file webapp of the same name that used the CGI protocol. This convention is completely arbitrary and you should neither assume that URLs looking like that actually use CGI, nor that all CGI webapps must look like that.

huhu
Feb 24, 2006
I'm looking for some sort of a database/Excel doc/online table of data that gets updated every few seconds. I'm thinking anything like a tracking of recent twitter hashtags or github activity. I'm learning to create dashboards that monitor this kind of stuff and I have no idea how to search for such a thing. I'm hoping that it gets updated every few seconds so I can see visible changes in my dashboard.

Stinky_Pete
Aug 16, 2015

Stinkier than your average bear
Lipstick Apathy

huhu posted:

I'm looking for some sort of a database/Excel doc/online table of data that gets updated every few seconds. I'm thinking anything like a tracking of recent twitter hashtags or github activity. I'm learning to create dashboards that monitor this kind of stuff and I have no idea how to search for such a thing. I'm hoping that it gets updated every few seconds so I can see visible changes in my dashboard.

You're probably gonna want to set up your own database and run a script that queries the Twitter API for trending tags and deposits to it.

dougdrums
Feb 25, 2005
CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)

Rocko Bonaparte posted:

Are there still any major websites using CGI for the pages? I am trying to contrast Django with the entirity of CGI, and it looks partisan to have ZERO examples to counterbalance even just Django--let alone MVC frameworks.

Edit: OMG I think Paypal still uses CGI. :stare:

Iirc CGI used to just straight up run your program with the arguments given over http, thereby creating a process per every request.

I believe most fancy web things now use what is called fastcgi. The program remains resident, and the arguments are sent over a socket instead of spawning a bew process each time. Fastcgi is simply a protocol offered by your web server of choice.

Rocko Bonaparte
Mar 12, 2002

Every day is Friday!
I know comparing MVC and CGI is apples-to-orange, so I will further clarify: as opposed to using a framework with the MVC pattern, how many big places are still vomiting all their code between piles of HTML tags?

mystes
May 31, 2006

Rocko Bonaparte posted:

I know comparing MVC and CGI is apples-to-orange, so I will further clarify: as opposed to using a framework with the MVC pattern, how many big places are still vomiting all their code between piles of HTML tags?
This is still a bit of a weird question, although certainly the trend has been away from embedding business logic code within HTML templates, and at the save time presumably the MVC *pattern* (as opposed to the framework named ASP MVC) would hopefully preclude that, but it's certainly possible to write non-MVC code that doesn't even use code within HTML templates, in CGI scripts even.

mystes fucked around with this message at 16:27 on Oct 13, 2016

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

Rocko Bonaparte posted:

I know comparing MVC and CGI is apples-to-orange, so I will further clarify: as opposed to using a framework with the MVC pattern, how many big places are still vomiting all their code between piles of HTML tags?

I would be surprised if websites which violated the MVC pattern at least a little weren't an outright majority, especially among large businesses. Honestly, even "uses a well-defined framework" feels like a pretty high bar.

dougdrums
Feb 25, 2005
CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)
I feel like the concept of MVC has been around forever, but the distinction has more use now that webpages are no longer static entities. Once flash, ajax, etc came into play it became less useful to structure things strictly in the manner of html templates.

dougdrums fucked around with this message at 18:40 on Oct 13, 2016

Rocko Bonaparte
Mar 12, 2002

Every day is Friday!

mystes posted:

This is still a bit of a weird question, although certainly the trend has been away from embedding business logic code within HTML templates, and at the save time presumably the MVC *pattern* (as opposed to the framework named ASP MVC) would hopefully preclude that, but it's certainly possible to write non-MVC code that doesn't even use code within HTML templates, in CGI scripts even.

Yes I'm talking about the pattern, not ASP.NET MVC. So I'm talking about the whole family like Rails, Django, ...things of that drape themselves in that pattern.

ShoulderDaemon posted:

I would be surprised if websites which violated the MVC pattern at least a little weren't an outright majority, especially among large businesses. Honestly, even "uses a well-defined framework" feels like a pretty high bar.

That's still fine. They actually use something like that versus just shoving everything into static pages--include (possibly raw) database queries. That's my distinction.

mystes
May 31, 2006

Most people use frameworks or microframeworks now, though. It was never really easy to write each page as an HTML template in the first place, it was just a side effect of how php and the original asp worked. A smaller percentage use MVC oriented frameworks like rails and django, though, so you have to decide if you mean MVC or frameworks in general.

Dr Monkeysee
Oct 11, 2002

just a fox like a hundred thousand others
Nap Ghost
I think a lot of web "applications" these days generally handle all the UI client-side with React or Angular or Knockout and the thing formerly know as your MVC web site is now a REST web service talking JSON. You can write REST services with MVC frameworks but if you just need the REST portion of it there's slimmer patterns these days like ASP.NET WebApi* or Flask-Restful.


*I know webapi technically has "controllers" but the framework doesn't use all the same things you expect in an MVC framework like views, templates, viewdata dicts, etc.

LLSix
Jan 20, 2010

The real power behind countless overlords

I need some help picking out a data structure for a new project. I need to be able to build a set of labeled counters from an input of 10,000 (10k) - 100,000 (100k) instructions. Every time an instruction has the same label after the first time, I need to reduce the counter by an amount specified in the instruction and when the counter reaches 0 I should throw away the label and counter. So I need fast insert, fast delete, and fast modify times.

From checking a random selection of the labels I expect to need to be tracking somewhere around 1,000 (1k) at a time before their counters hit 0 and I should discard them. That's just a guess though; in theory there's nothing saying all the instructions aren't just adding new labels to track.

The labels are 12 character strings, but a random sampling shows that whatever is being used to generate them isn't that random. They almost all have the same 4 middle characters for example. There's also lots of cases of a 5 or 6 labels in a row that are exactly the same except for the last character having been incremented e.g.
qwertyuiopaA
qwertyuiopaB
qwertyuiopaC
qwertyuiopaD

Is a hash table going to be good enough for my needs? I'm worried that the strings won't hash well and I'll tend more towards the worst case behavior than the average. what alternative data structure would you recommend?

Its been awhile since I had to think through efficiency concerns like this and I'm having trouble remembering how to select a data structure.

I'm cross-posting this from the C++ thread. I'll be implementing this in C++ but since most of the question is about data structures in general I thought it might be a good fit here too.

Jabor
Jul 16, 2010

#1 Loser at SpaceChem
If a hash function hashes qwertyuiopaA and qwertyuiopaB to the same value, that's a really loving lovely hash function.

Eela6
May 25, 2007
Shredded Hen
Hash tables (a-la python's dictionaries) are indeed what you want here. A proper hash shouldnt have any problems with the keys being close values. See 'Universal Hashing', section 11.3.3, in the CLRS 'Introduction to Algorithms, 3rd ed'

Adbot
ADBOT LOVES YOU

Skandranon
Sep 6, 2008
fucking stupid, dont listen to me
Even the worst case behavior for a hashtable is going to be an order of magnitude or more faster than iterating over 100k array.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply