Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
TooMuchAbstraction
Oct 14, 2012

I spent four years making
Waves of Steel
Hell yes I'm going to turn my avatar into an ad for it.
Fun Shoe

Skandranon posted:

Even the worst case behavior for a hashtable is going to be an order of magnitude or more faster than iterating over 100k array.

Hashtables are amortized O(1) for access; I remember that much even if I can't remember exactly how that works. :v:

Adbot
ADBOT LOVES YOU

carry on then
Jul 10, 2010

by VideoGames

(and can't post for 10 years!)

TooMuchAbstraction posted:

Hashtables are amortized O(1) for access; I remember that much even if I can't remember exactly how that works. :v:

The hashing function takes properties of the item to be stored and transforms it into an integer that can be mapped to the indices of the array backing the hashtable. So basically, the data in the object tells you what index it should be stored in, and the table implementation will have a method of resolving conflicts if two different objects hash to the same value. So if your hashing function is O(1), then your retrieval is O(1) since arrays are constant time random access.

That's the simplified explanation, anyway.

TooMuchAbstraction
Oct 14, 2012

I spent four years making
Waves of Steel
Hell yes I'm going to turn my avatar into an ad for it.
Fun Shoe

carry on then posted:

The hashing function takes properties of the item to be stored and transforms it into an integer that can be mapped to the indices of the array backing the hashtable. So basically, the data in the object tells you what index it should be stored in, and the table implementation will have a method of resolving conflicts if two different objects hash to the same value. So if your hashing function is O(1), then your retrieval is O(1) since arrays are constant time random access.

That's the simplified explanation, anyway.

That much is obvious; what's not obvious is why it's amortized O(1) instead of just O(1) as it would be if your explanation was all there was to it. There's something in there about hash function requirements and/or what happens when there's a hash collision.

carry on then
Jul 10, 2010

by VideoGames

(and can't post for 10 years!)

TooMuchAbstraction posted:

That much is obvious; what's not obvious is why it's amortized O(1) instead of just O(1) as it would be if your explanation was all there was to it. There's something in there about hash function requirements and/or what happens when there's a hash collision.

Because if the table were right up to its load factor threshold with collisions, then you'd spend some proportion of n either bouncing around under open addressing or searching the linked structure under separate chaining to find the object you're looking for, probably. But that case is rare enough not to really affect the average case runtime, is what I gather the amortized qualifier means.

Steve French
Sep 8, 2003

The amortized qualifier most accurately applies to hash table insertion, where in the typical case the operation is constant time, but occasionally the hash table is resized, which is O(N). However, you've necessarily done N inserts at that point, so the cost is "amortized" over the previous inserts, such that in aggregate, the total time spent inserting is a constant factor of the number of inserts.

The fact that you may have a poor hash function or otherwise have poor distribution of items across buckets, resulting in non-constant lookup time, is more accounted for in the worst case analysis of reads, not so much that it is amortized constant time.

carry on then
Jul 10, 2010

by VideoGames

(and can't post for 10 years!)

So lookup for hashtables is not amortized O(1), just unqualified O(1)?

Steve French
Sep 8, 2003

No, it's just not clear to me that amortized is the correct qualifier for lookup. There are numerous ways in which different hash table implementations can fall short of constant time lookup; hash collisions is one such reason. Average case or best case are likely correct qualifiers for constant time lookup for most implementations.

e: put differently, while both insert and lookup operations can be O(N), in the amortized constant time insert case, there is a limit to the number of times that linear time operation occurs. However, in the worst case lookup (like, say, all elements in the hash table collide in the same bucket and the hash table uses simple linked-list unsorted chaining), *every* lookup will be O(N).

Steve French fucked around with this message at 19:29 on Oct 15, 2016

TooMuchAbstraction
Oct 14, 2012

I spent four years making
Waves of Steel
Hell yes I'm going to turn my avatar into an ad for it.
Fun Shoe
Aha, thanks for the explanation! That jives with my dim memory of the Algorithms course I took ~12 years ago.

Sinestro
Oct 31, 2010

The perfect day needs the perfect set of wheels.
What's the best way to handle a web application that would mostly be handling data with a fixed schema but needs to allow the user to associate custom fields with their own items? The fixed part of the database is set up with Postgres right now, but it's still in the prototype stage so it wouldn't be too much of a problem to change that now.

After a bit of research, I think that these are my main options, as ordered by how much work it puts on me.
  1. Simply add a table for each extensible type that provides simple key-value storage associated with a row of the table. This seems laughably inefficient, unfortunately.
  2. Use a NoSQL database in addition to Postgres, storing a reference to a document in each row that would contain the annotations. Unfortunately, the only one with actively maintained bindings is MongoDB.
  3. Dynamically updated tables for each user with custom data. Would require simple-ish custom SQL that I know how to write, and migrations aren't too poorly understood, but I don't know much about that topic and it'd be exposing a fairly tricky tool to users. It's aimed at IT personnel, but I don't know how much that'd help. This is the only one that has an obvious way to enforce any degree of consistency within a user's dataset at the datastore level.
  4. Leverage Postgres' built-in key/value options like HStore or JSON fields. I'm guessing that from a purely technical perspective, this is the best option, but there's not really any support for either in the libraries that I'm using and would have to do either significant work to add it to them or do a lot of stuff with raw SQL when I'd rather never do that ever.
  5. Implement everything on top of a NoSQL database. Unfortunately, the app's functionality is pretty much just "store your data here and then use my nice front end to look through the relations", so it's not very well-suited to non-relational storage.

It'd not be too hard to just write things and see which one makes me cry blood the least, but I don't exactly know much about this end of things and I'm probably missing something obvious and easy(er).

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

I didn't really put any thought into your post or really read it because I'm a bad poster, but I noticed you mentioning about using a NoSQL database in addition to Postgres.

Is there a reason you couldn't just use Postgres' own JSON field types?

Sedro
Dec 31, 2008
If you don't need to query the custom fields, shove it all into a JSON column.

If you do need to query them, you will need to write some SQL (if you can call it that), or find a library. What technology are you using?

redleader
Aug 18, 2005

Engage according to operational parameters
There's nothing wrong with an EAV (entity-attribute-value - your option 1) schema, and they're not terribly inefficient if indexed properly. It might be a bit tricky to actually query, depending on your ORM or whatever. Some people consider EAV tables to be a code smell or even an outright antipattern, usually because people often go overboard and jam all their data into EAVs (at which point you've just designed a lovely key-value store).

I'm not familiar with Postgres' hstore or json features, but I'd definitely look into those. It's probably overkill to involve a separate nosql data store.

Sinestro
Oct 31, 2010

The perfect day needs the perfect set of wheels.

Thermopyle posted:

I didn't really put any thought into your post or really read it because I'm a bad poster, but I noticed you mentioning about using a NoSQL database in addition to Postgres.

Is there a reason you couldn't just use Postgres' own JSON field types?

They're there, but the database library that I'm using... actually does have what looks like JSON support, it's just not documented anywhere explicitly, but there is a PGJsonb column type and what looks like the ability to query the data inside? So scratch 2 and 5 off the list, since this either works or is close. JSONB seems pretty nice.

EAV isn't a huge problem since I think querying it won't be too bad (or at least I have the tools to do a good job of hiding the badness if my mental sketch is accurate), it was mostly the :siren: NEVER DO IT :siren: crowd that was making me think it was an awful idea. I even have a sketch of how I could at least make it easy to enforce consistency between rows that should be the same. I'll need to query them, unfortunately, so I'll have to write the SQL myself and then have low grade panic attacks about it not being good enough constantly.

Jo
Jan 24, 2005

:allears:
Soiled Meat
What's the name of the algorithm that selects a fixed region of a given minimum and maximum area, then places a point in it, repeating the process until the region is almost filled? It gives a nice not-quite-uniform but not-quite-random packing.

Xerophyte
Mar 17, 2008

This space intentionally left blank

Jo posted:

What's the name of the algorithm that selects a fixed region of a given minimum and maximum area, then places a point in it, repeating the process until the region is almost filled? It gives a nice not-quite-uniform but not-quite-random packing.

There are a few things that do what I think you're after. You can subdivide your region into an array of buckets of fixed equal size and then uniformly sample a point in each bucket, which is called jittered or stratified sampling. There are various pseudo-random sequences (Hammersley, Halton, etc) that are designed such that taking the first N elements of the sequence results in an even-ish distribution of points. There's maximal poisson disk sampling, where you try to find a new point that's sufficiently far away from your old points; this is often referred to as dart throwing, from the first implementing algorithm that uniformly selected candidate samples and would accept or reject them depending on distance to the closest existing accepted sample.

There's a lot of variations. As a category the entire "not uniform but not random" sampling thing is generally called blue noise sampling or low-discrepancy sampling.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



Jo posted:

What's the name of the algorithm that selects a fixed region of a given minimum and maximum area, then places a point in it, repeating the process until the region is almost filled? It gives a nice not-quite-uniform but not-quite-random packing.

https://en.wikipedia.org/wiki/Voronoi_diagram ?

Jo
Jan 24, 2005

:allears:
Soiled Meat

Xerophyte posted:

jittered or stratified sampling

This has me on the right track, I think. Much obliged.

Cuntpunch
Oct 3, 2003

A monkey in a long line of kings
Is my lack of a formal education leaving me high-and-dry, or am I correct in feeling like this is...not a reasonable way to construct(and resolve) a class hierarchy?

code:

public enum ParentType
    {
        Mom,
        Dad,
    }

    public interface IGrandad
    {
        ParentType PType { get; set; }
    }

    public interface IMom
    {
        void MomThings();
    }

    public interface IDad
    {
        void DadThings();
    }
    
    public class Dad : IDad, IGrandad {/*snip*/}
    public class Mom : IMom, IGrandad {/*snip*/}

    public class ParentFinder
    {
        public void DoParentThings(IEnumerable<IGrandad> list)
        {
            foreach (var thing in list)
            {
                switch (thing.PType)
                {
                    case ParentType.Mom:
                        (thing as Mom).MomThings();
                        break;

                    default:
                        (thing as Dad).DadThings();
                        break;
                }
            }
        }
    }
Because using an interface simply to provide a key enum about child types just feels...really really wrong to me. But this comes from some rather senior people that I don't know well enough yet to doubt.

spiritual bypass
Feb 19, 2008

Grimey Drawer
That poo poo doesn't make any sense at all. Couldn't Mom and Dad implement IParent? Why would you switch on the type of Mom and Dad? IParent should have a changeDiaper() method or something that changes a diaper, but the details should be hidden from whoever is using it. You shouldn't need to know the specific type of an object, only the contracts it claims to fulfill.

dupersaurus
Aug 1, 2012

Futurism was an art movement where dudes were all 'CARS ARE COOL AND THE PAST IS FOR CHUMPS. LET'S DRAW SOME CARS.'

Cuntpunch posted:

But this comes from some rather senior people that I don't know well enough yet to doubt.

Seniority does not imply competence

Asymmetrikon
Oct 30, 2009

I believe you're a big dork!
If it didn't have the IMom and IDad and was just switching on the specific type of the IGrandad, it would be fine as a kind of algebraic data type, but I have no idea what those interfaces are doing. Though if all it's really doing is calling a method on the object no matter what, there's no reason for that kind of thing.

Cuntpunch
Oct 3, 2003

A monkey in a long line of kings
What feels bonkers to me is the use of an upper level interface to do little more than ensure there's an enum to flag which child-type a thing is? And that's a simplified example, in practice we've got this sort of 'intermediary interface simply defines an enum for a lower level type for casting purposes' up to 3 layers deep in some places.

dougdrums
Feb 25, 2005
CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)
Yeah, nothing is going to inherent from Mom or Dad, so you might as well seal them and use Type.IsAssignableFrom() to do your type matching, instead of having a big circular mess.

It's not the worst thing ever, but it's a bit overkill based on what was posted.

I mean generally you'd make an IDad and then have class Person extend IDad to make a Dad, because it's also likely that Mom and Dad share the same members, but it's just that one of them needs to do some IDad stuff now.

I guess another approach would be to hit the enums with [Flags] and get rid of the unnecessary hierarchy.

dougdrums fucked around with this message at 15:04 on Oct 18, 2016

Peristalsis
Apr 5, 2004
Move along.
I'm writing a small Ruby on Rails* app to automate the sending of text messages on a user-configurable schedule.

I have a sheduled_texts table, and I'll probably just have a flag on each text to indicate if it has already been sent, still needs to be sent, or generated some error when being sent. I don't anticipate any problems with this approach. However, if this app were being used on a larger scale, having the table of texts to be sent cluttered up with every text that has ever been sent could get ugly. At my first programming job, many years ago, we had archive tables for this sort of thing, but it just seems wrong to have multiple tables with the same schema, even for archival purposes. It would also be a headache to maintain both tables whenever there's a change in the schema, and a potential problem when a change to the current table could invalidate data in the archive table.

Is there a better, more modern best practice for this? I've Googled a bit, but have only found descriptions of ETL solutions (which are well beyond the complexity I need for this, and which don't really solve the issue on a fundamental level) or multi-server indexing or something that I don't even understand.


* I didn't post in the RoR thread because this is a more general database design question - the development platform is pretty irrelevant here.

spiritual bypass
Feb 19, 2008

Grimey Drawer
Seems like you could have a partition on the sent_timestamp or whatever. Depending on your database system and how you set it up, you now have something close to two physical tables that are logically the same when you query against it.

me your dad
Jul 25, 2006

General question to see if something is feasible:

I build HTML emails. I get copy delivered in Word documents, which I transfer into HTML. We are a fairly large organization and content comes to me from many people. Some people copy content for their emails from other emails we've previously sent out. As a result, links in the Word documents are sometimes copied as the tracking URLs used by our email service provider. This is not good for my workflow.

Here's an example of how a tracking URL looks:

http://subdomain.domain.com/ct/90809005:Ts44xFYVN:m:1:1612444031:6B9F5AA951851A1C84240C42C18EFEB7:r

The only consistent elements are http://subdomain.domain.com/ct/ and the final letter at the very end. The alphanumeric strings are unique for each link.

When clicked, that tracking URL will go to the actual destination URL. I need those final destination URLs for when I'm building the emails out in HTML and me having to manually open each one is tedious and opens room for errors.

So...

I'm trying to find out if there might be a way to automatically replace the tracking URLs with the actual target URLs. Any general direction of a resource would be appreciated. I'm happy to dig into a solution if I can get started down the right path.

Any ideas?

nielsm
Jun 1, 2009



If you need to open everything in Word anyway, to get the text out, the best solution workflow-wise may be a VBA macro to sanitize the links. However I'm not sure how reasonable it is to do web requests from VBA.

The basic process would be to find all links, check for the pattern you describe, and then for each of them simply emulate a browser requesting the URL and see what comes back. Most tracking services would just return a basic redirect to the actual target, so you'd be done in a single request per unique URL.

It's definitely something that can be automated, the question is really just at what stage you want to do it.

Sab669
Sep 24, 2009

I have a DataTable with the following columns:

TableFieldID
EntityType
TableName
FieldName
Caption

TableName contains 3 letter abbreviations such as "ADT", "ADN", "rear end", "PAT" and a few others.

Caption contains "user friendly" values for various columns associated with the corresponding Table. Example being "Audit Create Date".

FieldName contains a string using the following "pattern": TableFieldID@EntityType#TableName.Caption except we remove any spaces from the Caption. So one of the values might be "49555@13#ADT.AuditCreateDate"

I need to remove all Audit-related items depending on the type of user that is logged in.

I was able to simply do something like this:

code:

myDataTable.Select("TableName LIKE 'ADT' OR TableName LIKE 'ADN' OR Caption LIKE '%Audit%'");

Then I just remove the returned collection from myDataTable.

This worked, until I realized there are columns which contain the string "audit", but aren't actually part of the set that should be removed, such as "Auditory Comprehension"

Any ideas on how I can deal with this? I can't add a space to my query -- '%Audit %' -- because then this will miss columns such as "Audited On".

There are over 1400 records in this result set, I don't really want to manually comb through this looking for exceptions to the rule but I think I might not have any options...

csammis
Aug 26, 2003

Mental Institution
You're probably screwed. There are types of fulltext indexes which can filter on stems - pass it "Audit" and you'd get "Audited" and "Auditing" but not "Auditory" - but unless your database supports such an index and you can apply it then you're stuck looking through the records by hand.

What is EntityType? Can you use that in your predicate somehow?

Sab669
Sep 24, 2009

Honestly I'm not sure what EntityType is. Fired off an email to the DBA :downs:

But yea, me thinks you're right. Most likely SOL.

me your dad
Jul 25, 2006

nielsm posted:

If you need to open everything in Word anyway, to get the text out, the best solution workflow-wise may be a VBA macro to sanitize the links. However I'm not sure how reasonable it is to do web requests from VBA.

The basic process would be to find all links, check for the pattern you describe, and then for each of them simply emulate a browser requesting the URL and see what comes back. Most tracking services would just return a basic redirect to the actual target, so you'd be done in a single request per unique URL.

It's definitely something that can be automated, the question is really just at what stage you want to do it.

Thanks. I'll look into the VBA solution!

mystes
May 31, 2006

nielsm posted:

If you need to open everything in Word anyway, to get the text out, the best solution workflow-wise may be a VBA macro to sanitize the links. However I'm not sure how reasonable it is to do web requests from VBA.

The basic process would be to find all links, check for the pattern you describe, and then for each of them simply emulate a browser requesting the URL and see what comes back. Most tracking services would just return a basic redirect to the actual target, so you'd be done in a single request per unique URL.

It's definitely something that can be automated, the question is really just at what stage you want to do it.
I really would not recommend using VBA when there's any alternative, especially when the file is being converted to HTML anyway. This is something that should be fairly simple to do in any language.

Steve French
Sep 8, 2003

Peristalsis posted:

I'm writing a small Ruby on Rails* app to automate the sending of text messages on a user-configurable schedule.

I have a sheduled_texts table, and I'll probably just have a flag on each text to indicate if it has already been sent, still needs to be sent, or generated some error when being sent. I don't anticipate any problems with this approach. However, if this app were being used on a larger scale, having the table of texts to be sent cluttered up with every text that has ever been sent could get ugly. At my first programming job, many years ago, we had archive tables for this sort of thing, but it just seems wrong to have multiple tables with the same schema, even for archival purposes. It would also be a headache to maintain both tables whenever there's a change in the schema, and a potential problem when a change to the current table could invalidate data in the archive table.

Is there a better, more modern best practice for this? I've Googled a bit, but have only found descriptions of ETL solutions (which are well beyond the complexity I need for this, and which don't really solve the issue on a fundamental level) or multi-server indexing or something that I don't even understand.


* I didn't post in the RoR thread because this is a more general database design question - the development platform is pretty irrelevant here.

Do you need to keep them around after they've been sent? If no, can you just delete them afterwards?

BedBuglet
Jan 13, 2016

Snippet of poetry or some shit
Can someone explain to me why Python uses a mutex (the GIL) when you're green threading? Like, the best I can guess is that maybe the cpython has a risk of creating race conditions due to threads of cpython running faster than Python code in the process?

Also, is Python itself already green threaded? It looks like, when you make a user thread in Python, you're actually threading into Python's underlying threading.

I don't know, the poo poo confuses me. It's some next level inception bs. All I know is that Python seems to get increasingly slower the more I use threading.

Jabor
Jul 16, 2010

#1 Loser at SpaceChem
Because correct and performant multithreading is hard to do right and easy to gently caress up, and just having a single lock for the interpreter is the best compromise for basically all the things you're supposed to use python threads for anyway.

Python isn't suitable for doing heavy compute tasks directly. You're supposed to call out to non-Python library code if you want to do that sort of heavy lifting.

BedBuglet
Jan 13, 2016

Snippet of poetry or some shit

Jabor posted:

Because correct and performant multithreading is hard to do right and easy to gently caress up, and just having a single lock for the interpreter is the best compromise for basically all the things you're supposed to use python threads for anyway.

Python isn't suitable for doing heavy compute tasks directly. You're supposed to call out to non-Python library code if you want to do that sort of heavy lifting.

Thanks. I was aware that Python was typically used more as glue between non Python code but didn't really appreciate why. That's a fair reasoning.

Peristalsis
Apr 5, 2004
Move along.

Steve French posted:

Do you need to keep them around after they've been sent? If no, can you just delete them afterwards?

I'm not sure if I'm required to keep them, but I want to do so at least to cover my own rear end (i.e. I don't want someone coming to me whining that some text didn't get sent, and not be able to look it up and see what's going on).

Steve French
Sep 8, 2003

Peristalsis posted:

I'm not sure if I'm required to keep them, but I want to do so at least to cover my own rear end (i.e. I don't want someone coming to me whining that some text didn't get sent, and not be able to look it up and see what's going on).

That would be a good use for logging

Peristalsis
Apr 5, 2004
Move along.

Steve French posted:

That would be a good use for logging

I suppose that's true, in this case. I'm curious about my original question, though - is there a best practice these days for tables that legitimately have too many rows? I'm not really sure what partitioning on a column means (though I can guess, or even look it up), but it sounds like there isn't just a single pattern that everyone uses and that takes care of the problem.

Adbot
ADBOT LOVES YOU

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed
There's a fairly narrow window where "too many rows" is actually a thing but a relational database is still the appropriate storage. Most database servers these days support sharding a table across multiple servers in a cluster.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply