|
Skandranon posted:Even the worst case behavior for a hashtable is going to be an order of magnitude or more faster than iterating over 100k array. Hashtables are amortized O(1) for access; I remember that much even if I can't remember exactly how that works.
|
# ? Oct 15, 2016 15:35 |
|
|
# ? May 15, 2024 18:55 |
|
TooMuchAbstraction posted:Hashtables are amortized O(1) for access; I remember that much even if I can't remember exactly how that works. The hashing function takes properties of the item to be stored and transforms it into an integer that can be mapped to the indices of the array backing the hashtable. So basically, the data in the object tells you what index it should be stored in, and the table implementation will have a method of resolving conflicts if two different objects hash to the same value. So if your hashing function is O(1), then your retrieval is O(1) since arrays are constant time random access. That's the simplified explanation, anyway.
|
# ? Oct 15, 2016 17:39 |
|
carry on then posted:The hashing function takes properties of the item to be stored and transforms it into an integer that can be mapped to the indices of the array backing the hashtable. So basically, the data in the object tells you what index it should be stored in, and the table implementation will have a method of resolving conflicts if two different objects hash to the same value. So if your hashing function is O(1), then your retrieval is O(1) since arrays are constant time random access. That much is obvious; what's not obvious is why it's amortized O(1) instead of just O(1) as it would be if your explanation was all there was to it. There's something in there about hash function requirements and/or what happens when there's a hash collision.
|
# ? Oct 15, 2016 18:18 |
|
TooMuchAbstraction posted:That much is obvious; what's not obvious is why it's amortized O(1) instead of just O(1) as it would be if your explanation was all there was to it. There's something in there about hash function requirements and/or what happens when there's a hash collision. Because if the table were right up to its load factor threshold with collisions, then you'd spend some proportion of n either bouncing around under open addressing or searching the linked structure under separate chaining to find the object you're looking for, probably. But that case is rare enough not to really affect the average case runtime, is what I gather the amortized qualifier means.
|
# ? Oct 15, 2016 18:37 |
|
The amortized qualifier most accurately applies to hash table insertion, where in the typical case the operation is constant time, but occasionally the hash table is resized, which is O(N). However, you've necessarily done N inserts at that point, so the cost is "amortized" over the previous inserts, such that in aggregate, the total time spent inserting is a constant factor of the number of inserts. The fact that you may have a poor hash function or otherwise have poor distribution of items across buckets, resulting in non-constant lookup time, is more accounted for in the worst case analysis of reads, not so much that it is amortized constant time.
|
# ? Oct 15, 2016 18:58 |
|
So lookup for hashtables is not amortized O(1), just unqualified O(1)?
|
# ? Oct 15, 2016 19:19 |
|
No, it's just not clear to me that amortized is the correct qualifier for lookup. There are numerous ways in which different hash table implementations can fall short of constant time lookup; hash collisions is one such reason. Average case or best case are likely correct qualifiers for constant time lookup for most implementations. e: put differently, while both insert and lookup operations can be O(N), in the amortized constant time insert case, there is a limit to the number of times that linear time operation occurs. However, in the worst case lookup (like, say, all elements in the hash table collide in the same bucket and the hash table uses simple linked-list unsorted chaining), *every* lookup will be O(N). Steve French fucked around with this message at 19:29 on Oct 15, 2016 |
# ? Oct 15, 2016 19:24 |
|
Aha, thanks for the explanation! That jives with my dim memory of the Algorithms course I took ~12 years ago.
|
# ? Oct 15, 2016 20:15 |
|
What's the best way to handle a web application that would mostly be handling data with a fixed schema but needs to allow the user to associate custom fields with their own items? The fixed part of the database is set up with Postgres right now, but it's still in the prototype stage so it wouldn't be too much of a problem to change that now. After a bit of research, I think that these are my main options, as ordered by how much work it puts on me.
It'd not be too hard to just write things and see which one makes me cry blood the least, but I don't exactly know much about this end of things and I'm probably missing something obvious and easy(er).
|
# ? Oct 15, 2016 21:01 |
|
I didn't really put any thought into your post or really read it because I'm a bad poster, but I noticed you mentioning about using a NoSQL database in addition to Postgres. Is there a reason you couldn't just use Postgres' own JSON field types?
|
# ? Oct 15, 2016 22:23 |
|
If you don't need to query the custom fields, shove it all into a JSON column. If you do need to query them, you will need to write some SQL (if you can call it that), or find a library. What technology are you using?
|
# ? Oct 15, 2016 22:36 |
|
There's nothing wrong with an EAV (entity-attribute-value - your option 1) schema, and they're not terribly inefficient if indexed properly. It might be a bit tricky to actually query, depending on your ORM or whatever. Some people consider EAV tables to be a code smell or even an outright antipattern, usually because people often go overboard and jam all their data into EAVs (at which point you've just designed a lovely key-value store). I'm not familiar with Postgres' hstore or json features, but I'd definitely look into those. It's probably overkill to involve a separate nosql data store.
|
# ? Oct 16, 2016 00:38 |
|
Thermopyle posted:I didn't really put any thought into your post or really read it because I'm a bad poster, but I noticed you mentioning about using a NoSQL database in addition to Postgres. They're there, but the database library that I'm using... actually does have what looks like JSON support, it's just not documented anywhere explicitly, but there is a PGJsonb column type and what looks like the ability to query the data inside? So scratch 2 and 5 off the list, since this either works or is close. JSONB seems pretty nice. EAV isn't a huge problem since I think querying it won't be too bad (or at least I have the tools to do a good job of hiding the badness if my mental sketch is accurate), it was mostly the NEVER DO IT crowd that was making me think it was an awful idea. I even have a sketch of how I could at least make it easy to enforce consistency between rows that should be the same. I'll need to query them, unfortunately, so I'll have to write the SQL myself and then have low grade panic attacks about it not being good enough constantly.
|
# ? Oct 16, 2016 00:54 |
What's the name of the algorithm that selects a fixed region of a given minimum and maximum area, then places a point in it, repeating the process until the region is almost filled? It gives a nice not-quite-uniform but not-quite-random packing.
|
|
# ? Oct 16, 2016 03:43 |
|
Jo posted:What's the name of the algorithm that selects a fixed region of a given minimum and maximum area, then places a point in it, repeating the process until the region is almost filled? It gives a nice not-quite-uniform but not-quite-random packing. There are a few things that do what I think you're after. You can subdivide your region into an array of buckets of fixed equal size and then uniformly sample a point in each bucket, which is called jittered or stratified sampling. There are various pseudo-random sequences (Hammersley, Halton, etc) that are designed such that taking the first N elements of the sequence results in an even-ish distribution of points. There's maximal poisson disk sampling, where you try to find a new point that's sufficiently far away from your old points; this is often referred to as dart throwing, from the first implementing algorithm that uniformly selected candidate samples and would accept or reject them depending on distance to the closest existing accepted sample. There's a lot of variations. As a category the entire "not uniform but not random" sampling thing is generally called blue noise sampling or low-discrepancy sampling.
|
# ? Oct 16, 2016 05:31 |
|
Jo posted:What's the name of the algorithm that selects a fixed region of a given minimum and maximum area, then places a point in it, repeating the process until the region is almost filled? It gives a nice not-quite-uniform but not-quite-random packing. https://en.wikipedia.org/wiki/Voronoi_diagram ?
|
# ? Oct 17, 2016 15:02 |
Xerophyte posted:jittered or stratified sampling This has me on the right track, I think. Much obliged.
|
|
# ? Oct 18, 2016 00:24 |
|
Is my lack of a formal education leaving me high-and-dry, or am I correct in feeling like this is...not a reasonable way to construct(and resolve) a class hierarchy?code:
|
# ? Oct 18, 2016 00:33 |
|
That poo poo doesn't make any sense at all. Couldn't Mom and Dad implement IParent? Why would you switch on the type of Mom and Dad? IParent should have a changeDiaper() method or something that changes a diaper, but the details should be hidden from whoever is using it. You shouldn't need to know the specific type of an object, only the contracts it claims to fulfill.
|
# ? Oct 18, 2016 01:30 |
|
Cuntpunch posted:But this comes from some rather senior people that I don't know well enough yet to doubt. Seniority does not imply competence
|
# ? Oct 18, 2016 01:36 |
|
If it didn't have the IMom and IDad and was just switching on the specific type of the IGrandad, it would be fine as a kind of algebraic data type, but I have no idea what those interfaces are doing. Though if all it's really doing is calling a method on the object no matter what, there's no reason for that kind of thing.
|
# ? Oct 18, 2016 01:56 |
|
What feels bonkers to me is the use of an upper level interface to do little more than ensure there's an enum to flag which child-type a thing is? And that's a simplified example, in practice we've got this sort of 'intermediary interface simply defines an enum for a lower level type for casting purposes' up to 3 layers deep in some places.
|
# ? Oct 18, 2016 13:49 |
|
Yeah, nothing is going to inherent from Mom or Dad, so you might as well seal them and use Type.IsAssignableFrom() to do your type matching, instead of having a big circular mess. It's not the worst thing ever, but it's a bit overkill based on what was posted. I mean generally you'd make an IDad and then have class Person extend IDad to make a Dad, because it's also likely that Mom and Dad share the same members, but it's just that one of them needs to do some IDad stuff now. I guess another approach would be to hit the enums with [Flags] and get rid of the unnecessary hierarchy. dougdrums fucked around with this message at 15:04 on Oct 18, 2016 |
# ? Oct 18, 2016 14:56 |
|
I'm writing a small Ruby on Rails* app to automate the sending of text messages on a user-configurable schedule. I have a sheduled_texts table, and I'll probably just have a flag on each text to indicate if it has already been sent, still needs to be sent, or generated some error when being sent. I don't anticipate any problems with this approach. However, if this app were being used on a larger scale, having the table of texts to be sent cluttered up with every text that has ever been sent could get ugly. At my first programming job, many years ago, we had archive tables for this sort of thing, but it just seems wrong to have multiple tables with the same schema, even for archival purposes. It would also be a headache to maintain both tables whenever there's a change in the schema, and a potential problem when a change to the current table could invalidate data in the archive table. Is there a better, more modern best practice for this? I've Googled a bit, but have only found descriptions of ETL solutions (which are well beyond the complexity I need for this, and which don't really solve the issue on a fundamental level) or multi-server indexing or something that I don't even understand. * I didn't post in the RoR thread because this is a more general database design question - the development platform is pretty irrelevant here.
|
# ? Oct 18, 2016 16:25 |
|
Seems like you could have a partition on the sent_timestamp or whatever. Depending on your database system and how you set it up, you now have something close to two physical tables that are logically the same when you query against it.
|
# ? Oct 18, 2016 17:04 |
|
General question to see if something is feasible: I build HTML emails. I get copy delivered in Word documents, which I transfer into HTML. We are a fairly large organization and content comes to me from many people. Some people copy content for their emails from other emails we've previously sent out. As a result, links in the Word documents are sometimes copied as the tracking URLs used by our email service provider. This is not good for my workflow. Here's an example of how a tracking URL looks: http://subdomain.domain.com/ct/90809005:Ts44xFYVN:m:1:1612444031:6B9F5AA951851A1C84240C42C18EFEB7:r The only consistent elements are http://subdomain.domain.com/ct/ and the final letter at the very end. The alphanumeric strings are unique for each link. When clicked, that tracking URL will go to the actual destination URL. I need those final destination URLs for when I'm building the emails out in HTML and me having to manually open each one is tedious and opens room for errors. So... I'm trying to find out if there might be a way to automatically replace the tracking URLs with the actual target URLs. Any general direction of a resource would be appreciated. I'm happy to dig into a solution if I can get started down the right path. Any ideas?
|
# ? Oct 18, 2016 18:53 |
If you need to open everything in Word anyway, to get the text out, the best solution workflow-wise may be a VBA macro to sanitize the links. However I'm not sure how reasonable it is to do web requests from VBA. The basic process would be to find all links, check for the pattern you describe, and then for each of them simply emulate a browser requesting the URL and see what comes back. Most tracking services would just return a basic redirect to the actual target, so you'd be done in a single request per unique URL. It's definitely something that can be automated, the question is really just at what stage you want to do it.
|
|
# ? Oct 18, 2016 19:10 |
|
I have a DataTable with the following columns: TableFieldID EntityType TableName FieldName Caption TableName contains 3 letter abbreviations such as "ADT", "ADN", "rear end", "PAT" and a few others. Caption contains "user friendly" values for various columns associated with the corresponding Table. Example being "Audit Create Date". FieldName contains a string using the following "pattern": TableFieldID@EntityType#TableName.Caption except we remove any spaces from the Caption. So one of the values might be "49555@13#ADT.AuditCreateDate" I need to remove all Audit-related items depending on the type of user that is logged in. I was able to simply do something like this: code:
Then I just remove the returned collection from myDataTable. This worked, until I realized there are columns which contain the string "audit", but aren't actually part of the set that should be removed, such as "Auditory Comprehension" Any ideas on how I can deal with this? I can't add a space to my query -- '%Audit %' -- because then this will miss columns such as "Audited On". There are over 1400 records in this result set, I don't really want to manually comb through this looking for exceptions to the rule but I think I might not have any options...
|
# ? Oct 18, 2016 19:23 |
|
You're probably screwed. There are types of fulltext indexes which can filter on stems - pass it "Audit" and you'd get "Audited" and "Auditing" but not "Auditory" - but unless your database supports such an index and you can apply it then you're stuck looking through the records by hand. What is EntityType? Can you use that in your predicate somehow?
|
# ? Oct 18, 2016 19:34 |
|
Honestly I'm not sure what EntityType is. Fired off an email to the DBA But yea, me thinks you're right. Most likely SOL.
|
# ? Oct 18, 2016 19:42 |
|
nielsm posted:If you need to open everything in Word anyway, to get the text out, the best solution workflow-wise may be a VBA macro to sanitize the links. However I'm not sure how reasonable it is to do web requests from VBA. Thanks. I'll look into the VBA solution!
|
# ? Oct 18, 2016 20:13 |
|
nielsm posted:If you need to open everything in Word anyway, to get the text out, the best solution workflow-wise may be a VBA macro to sanitize the links. However I'm not sure how reasonable it is to do web requests from VBA.
|
# ? Oct 19, 2016 03:22 |
|
Peristalsis posted:I'm writing a small Ruby on Rails* app to automate the sending of text messages on a user-configurable schedule. Do you need to keep them around after they've been sent? If no, can you just delete them afterwards?
|
# ? Oct 19, 2016 05:26 |
|
Can someone explain to me why Python uses a mutex (the GIL) when you're green threading? Like, the best I can guess is that maybe the cpython has a risk of creating race conditions due to threads of cpython running faster than Python code in the process? Also, is Python itself already green threaded? It looks like, when you make a user thread in Python, you're actually threading into Python's underlying threading. I don't know, the poo poo confuses me. It's some next level inception bs. All I know is that Python seems to get increasingly slower the more I use threading.
|
# ? Oct 19, 2016 11:06 |
|
Because correct and performant multithreading is hard to do right and easy to gently caress up, and just having a single lock for the interpreter is the best compromise for basically all the things you're supposed to use python threads for anyway. Python isn't suitable for doing heavy compute tasks directly. You're supposed to call out to non-Python library code if you want to do that sort of heavy lifting.
|
# ? Oct 19, 2016 11:21 |
|
Jabor posted:Because correct and performant multithreading is hard to do right and easy to gently caress up, and just having a single lock for the interpreter is the best compromise for basically all the things you're supposed to use python threads for anyway. Thanks. I was aware that Python was typically used more as glue between non Python code but didn't really appreciate why. That's a fair reasoning.
|
# ? Oct 19, 2016 11:35 |
|
Steve French posted:Do you need to keep them around after they've been sent? If no, can you just delete them afterwards? I'm not sure if I'm required to keep them, but I want to do so at least to cover my own rear end (i.e. I don't want someone coming to me whining that some text didn't get sent, and not be able to look it up and see what's going on).
|
# ? Oct 19, 2016 18:28 |
|
Peristalsis posted:I'm not sure if I'm required to keep them, but I want to do so at least to cover my own rear end (i.e. I don't want someone coming to me whining that some text didn't get sent, and not be able to look it up and see what's going on). That would be a good use for logging
|
# ? Oct 19, 2016 18:41 |
|
Steve French posted:That would be a good use for logging I suppose that's true, in this case. I'm curious about my original question, though - is there a best practice these days for tables that legitimately have too many rows? I'm not really sure what partitioning on a column means (though I can guess, or even look it up), but it sounds like there isn't just a single pattern that everyone uses and that takes care of the problem.
|
# ? Oct 19, 2016 20:04 |
|
|
# ? May 15, 2024 18:55 |
|
There's a fairly narrow window where "too many rows" is actually a thing but a relational database is still the appropriate storage. Most database servers these days support sharding a table across multiple servers in a cluster.
|
# ? Oct 20, 2016 01:53 |