|
Isn't this essentially a distributed union-find?
|
# ? Feb 4, 2022 06:27 |
|
|
# ? Jun 8, 2024 09:26 |
|
Jabor posted:Isn't this essentially a distributed union-find? It might be! I'll look into that. Thanks. The briefest of Google searches seem to have a lot of references to grouping pixels/voxels together so I'm hoping the established algorithms are able to handle nodes with lots of edges. Also, a lot of these papers are from the last couple years so it seems like this is a pretty active area of research? Edit: From this paper https://arxiv.org/pdf/2003.02351.pdf quote:2.2 Distributed and Parallel Visualization and Analysis Because our software is search-based and it's much simpler for me to get a list of all matches to a given document than it is to get a set of individual union operations, it seems like distributed breadth-first CCL is maybe something else to look for. Double Edit: I haven't read enough yet to know if this is the case, but one of the things that immediately jumps out to me is that because the algorithms I can find focus on traversing images, they are likely optimized for deep but narrow trees. In practice, our application tends to be wide but shallow - it would be unusual for a cluster to take more than 2-3 iterations to find all connected values, but it would not be unusual for the cluster to have thousands of elements. Just thought I'd call that out, in case it's relevant. The methods I'm seeing also tend to focus on a separation between local and global structures, where each thread focuses on a chunk of data and generates smaller trees, that are then connected together in a second pass - which makes sense if you have a spatial dimension you can divide on, but I don't. Triple edit oh my god: Okay, I think I'm starting to understand union-find a little better and it seems like it could be applicable to situations where a single union is more than a pair. My biggest concern is that I have no guarantee that two threads won't decide to make two identical but backwards unions at the same time (and in fact, it's very likely) because I can't partition them into discrete sets (any document could theoretically match any other document, and the matching algorithm returns all matches in the database). KillHour fucked around with this message at 07:06 on Feb 4, 2022 |
# ? Feb 4, 2022 06:40 |
|
KillHour posted:My company's software has a matching process that is really neat but I think the actual algorithm for iterating over everything isn't very efficient and there's probably a really clever way to do this better. What is the actual problem statement here? You have a distributed database of (small) documents, and you want to take in a query and return a “superset” document that merges information from all the documents which partially match the query, with references to the original documents? Because it sounds like a trivially-distributed query with an inexpensive result-merging operation, and all of the extra complexity and expense is due to the effort to make it iterative.
|
# ? Feb 4, 2022 07:05 |
|
rjmccall posted:What is the actual problem statement here? You have a distributed database of (small) documents, and you want to take in a query and return a “superset” document that merges information from all the documents which partially match the query, with references to the original documents? Because it sounds like a trivially-distributed query with an inexpensive result-merging operation, and all of the extra complexity and expense is due to the effort to make it iterative. The problem statement is I have a document with lots of partial duplicates (names spelled wrong, different phone numbers, etc) and I want to process some kind of rule that says whether two documents are the "same" or not across all combinations of documents to end up with discrete sets of "mastered" entities. Think MDM. In certain circumstances, this might require multiple passes to ensure I get all matches. An example of this is if A would match B, A would NOT match C, and B would NOT match C, but the combination of A + B WOULD match C: A: Name: John Smyth Address: 123 Fake Street Phone: 555-555-5555 Email: jsmyth@company.com B: Name: John Smith Address: 123 Fake Street Phone: 555-555-5555 Email: johnsmyth@company.com C: Name: John Smith Address: null Phone: null Email: jsmyth@company.com So in this example, you need to merge A and B before you can realize that C should be merged as well, since C matches one thing from each of A and B but two things from the combination of them (assuming your heuristic is 2 fields or more match). The thing that our database is really good at is returning search results. So it's very fast for me to say "give me the list of all documents that have propertyA = foo AND propertyB = bar AND propertyC != baz" or even "Give me the list of all documents where propertyA is within a levenshtein distance of 1 of "foo" (so boo and foq but not fuu) or lots of other heuristics. It's also fast to combine these operations together. Our database is not very good at "iterate over every document and do x to it" so anything that we can use the search functionality to do is preferred.
|
# ? Feb 4, 2022 07:21 |
|
KillHour posted:The problem statement is I have a document with lots of partial duplicates (names spelled wrong, different phone numbers, etc) and I want to process some kind of rule that says whether two documents are the "same" or not across all combinations of documents to end up with discrete sets of "mastered" entities. Think MDM. In certain circumstances, this might require multiple passes to ensure I get all matches. An example of this is if A would match B, A would NOT match C, and B would NOT match C, but the combination of A + B WOULD match C: If you have a fixed set of fields and your records maintain an absolute order, you could create an index of number of matched fields per record. If your number of fields is small, you could use a bit set for matching a field. If you use a bit set then merging results is a bitwise or in your index.
|
# ? Feb 4, 2022 14:04 |
|
Unfortunately, the "any two fields match" is just an easy example. Different fields could have different weights (even negative ones), and thresholds could be arbitrary. In addition, you can have multiple actions with different thresholds. So to make the example slightly more complicated, you could say name is worth 10 points, phone number, email and address are each worth 5 points, and you need a total of 15 points to automatically merge or 10 points to go into a queue for manual review. The number of fields and fields themselves are also arbitrary. We're not trying to solve this for a single particular situation and a single set of documents - this is a framework designed for any possible set of matching criteria on arbitrary documents. Edit: wouldn't you need that bit set for each pair of records, not each individual record? There could easily be tens of millions of records. KillHour fucked around with this message at 15:41 on Feb 4, 2022 |
# ? Feb 4, 2022 15:39 |
|
KillHour posted:
So at work recently I've been implementing this thing called wait-free union find. I'm not sure if it specifically works for your case but it might be worth looking into
|
# ? Feb 5, 2022 02:10 |
|
HappyHippo posted:So at work recently I've been implementing this thing called wait-free union find. I'm not sure if it specifically works for your case but it might be worth looking into Thanks for the suggestion. Based on this paper - https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.8354&rep=rep1&type=pdf - it seems to assume there's a shared piece of memory that all threads can access to update their results. Because this is a transactional database, everything is inherently stateless and a thread only has access to itself. The only way to access data outside is to search (fast), read (kind of slow), or write (very slow) documents to the database. In addition, if I want to write a document without terminating the current thread, I need to create a new transaction in its own context, passing the data to be written to that transaction as a calling parameter, and then terminate that thread. In the current implementation, the thread / batch only ever writes out its state once at the very end. What I really need is a completely stateless algorithm where one thread never needs to know what another is doing, other than being able to check the results of already completed and terminated threads. In addition, threads can spawn in any order and might have any arbitrary subset of starting documents. Oh, and because of the way we handle scaling, it's possible for a server running a bunch of threads to just disappear mid transaction, so everything has to be guaranteed ACID so we can just resubmit that batch and be confident it's idempotent and not leaving garbage in the database. That is given to be true if it's done in a single transaction (since writes are committed after the transaction terminates), but that precludes spawning off transactions to write before the entire thing is finished. ....yeah. Edit: Sorry if it sounds like I'm being obtuse. I promise the system can actually work really well if you can design your algorithm in discrete steps such that a given step can run parallel in a single transaction without writing data until the end. What we can do is take the output of a step in aggregate and use that as the input to the next step, but you want to limit that as much as possible since writing to disk is slow. Double Edit: To make it more explicit how this is all orchestrated, we have a process that runs an arbitrary search on the database for some set of documents. Those documents are then split into batches (usually of between 1 and a few hundred documents) and for each batch, a V8 runtime is spun up in its own isolated context. I can order these documents by requesting the search results be ordered by their order in an index (so I could have all documents sorted alphabetically by last name, for instance), or I can leave them in "default" order, which is deterministic but pseudorandom. What I can't do is split the batches differently - they are always the same size (except the last batch) so it would be very difficult or impossible to ensure two given documents always end up in the same batch. This V8 context has access to the documents passed into it, as well as a set of APIs for searching, retrieving and writing documents to the database. Searching and retrieving documents happens immediately (assuming the document isn't locked by another thread) but writing only happens after the context is terminated and the transaction is complete. You can also spawn other transactions, but that's an antipattern for the reasons above (although sometimes it's unavoidable - when reading from two different databases, for instance). If you need to, you can let all the threads/batches complete and then run a new search on the database (likely for the output of the first step) and use those as input to the next step that repeats the process. I think at this point I'm rubber ducking, but it's nice to write it all down anyways. KillHour fucked around with this message at 04:41 on Feb 5, 2022 |
# ? Feb 5, 2022 03:56 |
|
Counter-intuitively, I think it’s at least possible that you would scale better if you were less worried about doing unnecessary work. Like, you could fetch all the documents that match your starting document for any field, without trying to make the database query enforce the rule about needing a match in multiple fields, and then iteratively (1) locally figure out which of the fetched documents not already in your equivalence class now belong in it, (2) figure what field values you’ve just added for the first time to your superset document, (3) do a query asking for matches for any of those, and (4) go back to step 1, adding any new documents you found. You’ll do some redundant I/O, but your queries will be smaller and simpler to execute, and you won’t have to build a bunch of temporary tables. And of course you can distribute this, so that you get all the transitive matches available locally on any node. It won’t stop you from having to do iterations, but you’ll probably converge a lot faster. And you can report partial matches back in your intermediate results, which would probably be useful in avoiding unnecessary iterations.
|
# ? Feb 5, 2022 05:52 |
|
rjmccall posted:and you won’t have to build a bunch of temporary tables. Not sure if this factors into what you're suggesting (since it's late and my brain isn't processing well) but we're a NoSQL database so there aren't any temporary tables (or tables at all). Data returned from a query is in JSON format and maps cleanly to a JS object when accessed (but is typically resolved lazily so you don't accidentally try to dump a huge query out from disk all at once). Complex nested queries are actually generally faster to do directly in the database than they are to do in the process itself. They work directly on the in-memory indexes on dedicated hardware and are implemented in C++ instead of running inside V8. As a general rule, any time I can use a native query to do some work for me instead of doing it in the context of the step, it's likely going to be faster. As an example, I ran a query that said "give me the contents of all documents with this name" on a pretty small dev instance. It took about 14ms to return the contents of 508 documents. Then I said "give me the IDs of the documents that match at least two of the 3 following criteria {name, email, phone}" and it took 2.5ms to return the 162 IDs of the specific set of documents I wanted, purely because everything happened in memory and I didn't need to wait for the disks to retrieve the actual JSON data (I disabled result caching for the test). The documents themselves aren't huge, but they aren't small either. They have a lot of fields I'm uninterested in for the purpose of mastering. I could get clever and have the DB only return the portions of the documents I care about (since it's a JSON tree I can use xpath for that), but it's probably still faster to go directly off of the indexes whenever possible. One thing that might be worth looking into is that the hardware that performs queries is physically separate from the hardware that performs the batch jobs and the former doesn't dynamically scale (because it's tied to the actual drives) while the later can dynamically scale quite large. So if I can keep the I/O down, it might be worth trying to spread the work out a bit more. Definitely wouldn't want to start pulling entire documents for comparisons, though, since that would crush my I/O. KillHour fucked around with this message at 06:46 on Feb 5, 2022 |
# ? Feb 5, 2022 06:33 |
|
Alright, fair enough. You’d said something about checking in the database that you hadn’t already merged a particular document, which sounds like a temporary object of some sort; that’s all I was thinking.
|
# ? Feb 5, 2022 07:02 |
|
rjmccall posted:Alright, fair enough. You’d said something about checking in the database that you hadn’t already merged a particular document, which sounds like a temporary object of some sort; that’s all I was thinking. Oh yeah, since the threads are all starting and completing at different times (there could be a hundred thousand batches but you only run 100 threads at a time, for instance), the ones that are complete have already written their results out. Since data is indexed as soon as it's written, it's fast to do a quick check of "is there any document in the 'merge' collection that has this document ID in it already?" Since the index is sorted, it's just a binary tree search to say if it exists or not. If it exists, we can skip it. It is a temporary document, but we need that document to actually complete the merge pass so it's overall pretty efficient and saves a bunch of work. Edit: Thanks for all the suggestions, seriously. It's helping me think more fully about what the system is actually doing instead of just the parts I'm inclined to focus on. KillHour fucked around with this message at 07:14 on Feb 5, 2022 |
# ? Feb 5, 2022 07:10 |
|
I'm struggling with something that I think might just be a bad approach to begin with. Basically, I'm on a .Net + JS project mainly doing frontend stuff for a webshop, and we send requests / get responses from a backend that's maintained separately by a different team. When there are user-facing errors we show a message with a generic message while the real fault info is hidden (e.g. "user is flagged as unwanted customer" or whatever). Now there's a desire from testers to see the real fault info in that error message when in test environments. So! The original fault info is now included as a header in the WCF response we get from said backend, and I can't for the life of me figure out how to propagate that to the frontend. I can intercept the message in an IClientMessageInspector and verify the header is present, but once it moves beyond that point all that remains is what's in the message body. My gut feeling is "if it's meant to be available on the frontend it goes in the body", but that's considered a security issue since then anyone could see that the fault codes are being masked, even if the original fault data isn't present at this point in production. I'm probably phrasing this all as a confusing mess, possibly because it feels entirely like a confusing mess to me. Is there a good way to do this? Is the approach of sending this data in a custom header reasonable in the first place? I might just be dumb as hell. Edit for tl;dr: I'm trying to get a custom header and its value from a WCF response and display it on a webpage, and it seems borderline impossible. Woebin fucked around with this message at 13:35 on Feb 7, 2022 |
# ? Feb 7, 2022 13:29 |
|
Log the actual error message (making sure to sanitize any sensitive info) and give testers the ability to access those logs. Don’t send that stuff back in any part of the response if you don’t want all users to be able to see it.
|
# ? Feb 7, 2022 13:34 |
|
Blinkz0rz posted:Log the actual error message (making sure to sanitize any sensitive info) and give testers the ability to access those logs. Don’t send that stuff back in any part of the response if you don’t want all users to be able to see it.
|
# ? Feb 7, 2022 13:36 |
|
By "frontend" do you mean client-side JS? Or do you mean the frontend web server that's turning your backend responses into something suitable to send to user's browsers? If there's something you want to keep secret, it shouldn't be sent to client-side JS at all. (Not even in a header!) You should be checking on your web server whether it's appropriate to show full debug information, and only if it is should you send that data down to the client. What's in the message body you're getting from your backends? Surely it's some sort of structured message and not just raw HTML that goes down to clients unchanged?
|
# ? Feb 7, 2022 13:39 |
|
Jabor posted:By "frontend" do you mean client-side JS? Or do you mean the frontend web server that's turning your backend responses into something suitable to send to user's browsers?
|
# ? Feb 7, 2022 13:48 |
|
So the "right" answer here is that the external backend shouldn't have that much control over what your frontend sends down to clients. Your frontend should have responsibility for its own JSON, and the backend should be able to add information as it pleases to your internal messaging without that affecting what gets sent down to clients. Actually getting to that point from your current situation sounds like it might be a challenge though. Figuring out where exactly your frontend turns WCF Messages into JSON is probably the best next step - it's at that point you could look at the Properties on the message and either include or not include the extra debugging data in the JSON you emit (depending on whether it's a test server or not).
|
# ? Feb 7, 2022 14:12 |
|
Jabor posted:So the "right" answer here is that the external backend shouldn't have that much control over what your frontend sends down to clients. Your frontend should have responsibility for its own JSON, and the backend should be able to add information as it pleases to your internal messaging without that affecting what gets sent down to clients. But yeah, I think your point holds and is also where I'm at - I need to figure out where we break down the original response message. Or there's the other option - convince the testers and business people that this is a bad idea and they should just learn to check the logs. Days like this really make me feel like I'm terrible at my job.
|
# ? Feb 7, 2022 14:28 |
|
Enjoy the poo poo storm that happens when someone forgets to check a box and the test version goes into prod.
|
# ? Feb 7, 2022 14:44 |
|
KillHour posted:Enjoy the poo poo storm that happens when someone forgets to check a box and the test version goes into prod. Java code:
|
# ? Feb 7, 2022 14:58 |
|
I give people too much credit
|
# ? Feb 7, 2022 15:16 |
|
i'm being tasked with learning Flutter for work. All expenses paid. What's the best resources for learning Flutter, not including a whole-rear end bootcamp?
|
# ? Feb 7, 2022 20:49 |
|
teen phone cutie posted:i'm being tasked with learning Flutter for work. All expenses paid. If you don't know dart yet, start there
|
# ? Feb 8, 2022 02:43 |
|
This is more of a networking question, maybe there isn't going to be much actual programming involved. I live in the US but I sometimes have to use NordVPN to switch to a Spanish IP so I can, for example, watch Spanish movies in the original language on Netflix. I have multiple devices including a TV connected to a router, and I want to be able to change location on a dime so I can watch Spanish Netflix on my TV. I have a raspberry pi which I have used before for Pi-Hole, and I suspect I could do something similar for this where I connect the Pi by NordVPN and route all traffic through it. Does that make sense? It's the next best thing to just loving moving to Spain
|
# ? Feb 13, 2022 08:18 |
|
Some routers can do a VPN for the whole network if that's what you want. Taking the Raspberry Pi approach would mean essentially making your own router. Some people do this, but it's complex and not worth it for most people.
|
# ? Feb 13, 2022 14:21 |
|
Coincidentally I am just bringing up a new OpenWrt router that supports pihole-style DNS adblocking and can run a VPN client. Quite happy with the ease of use and the flexibility so far. I don't think it'd be too difficult to to script up a toggle switch to change your VPN location, for instance. There's definitely a Pi image available.
|
# ? Feb 13, 2022 20:51 |
|
Dumb question incoming. I want to check Google map times for preset routes several times a day and log the times. I don't want to trust 'estimated times' that Google gives when you select your departure time because there's no report of standard deviation. I've tried googling this, but just get stuck in Google map how to pages. Has anyone heard of this bring done already, or do I need to set up something myself, if it's possible?
|
# ? Feb 21, 2022 03:14 |
|
Spikes32 posted:Dumb question incoming. I want to check Google map times for preset routes several times a day and log the times. I don't want to trust 'estimated times' that Google gives when you select your departure time because there's no report of standard deviation. I've tried googling this, but just get stuck in Google map how to pages. Has anyone heard of this bring done already, or do I need to set up something myself, if it's possible? https://developers.google.com/maps/documentation/directions/overview
|
# ? Feb 21, 2022 03:17 |
|
Yup OK that's what I expected.
|
# ? Feb 21, 2022 16:23 |
|
Kong or Istio (or a different thing) for JWT claim-based routing (user A goes to Service green, user B goes to Service blue) - who ya got?
Junkiebev fucked around with this message at 01:46 on Feb 26, 2022 |
# ? Feb 26, 2022 01:43 |
|
There's a new general AI/ML thread for anyone who wants to see that.
|
# ? Mar 6, 2022 19:33 |
|
Can anyone recommend good projects to learn/apply enough knowledge to become proficient in: - AWS/GCP (in whatever capacity companies seem to want me to be able to use it for) - Docker/Kubernetes - Django/MongoDB These seem to be some of the bigger/most frequently seen techs I see in job postings, and I am tired of my current job's "we only make CRUD apps" philosophy getting in my way of moving on. Furthermore, if there are any other really big/useful frameworks anyone could recommend, I am all ears.
|
# ? Mar 11, 2022 21:16 |
|
Gin_Rummy posted:Can anyone recommend good projects to learn/apply enough knowledge to become proficient in: Pluralsight is expensive (try to get your company to pay for it, maybe) but I cannot recommend it highly enough for these specific technologies. I have learned so god drat much about them from their courses. That said, just diving in with AWS isn't *that* bad. Make a free account, start with a basic Lambda function hooked to an API Gateway, and you can be up and running in like 5 minutes doing fun things
|
# ? Mar 11, 2022 21:24 |
|
Gin_Rummy posted:Can anyone recommend good projects to learn/apply enough knowledge to become proficient in: Yea here's a project that will hit all of those, be relatively simple to follow hello world tutorials for and is basically free. Project: Make a serverless web app to display each item in some data set on a unique page. Technology overview: Use the AWS SAM CLI to set up a docker container w/your django project that you will deploy to AWS Lambda, connect to an AWS RDS db, and provide a RESTful API for. Steps: Using the AWS SAM CLI set up a docker container to deploy to AWS Lambda. Once you have your project, use Django + Django CMS to make a hello world app, deploy it with your container. Expand your hello world webpage such that it will display rows in a database as a custom page. Can all be done from the backend, don't need to have frontend calls or anything. Use Django Rest Framework to provide an authenticated a REST API to access the data. Make a nice readme with some images and a clean, pinned requirements.txt if you want this to be fodder for your GitHub
|
# ? Mar 11, 2022 23:13 |
|
I kind of hate how something that incredibly obviously involves at least two different kinds of "server" is described as "serverless", even if I understand the point.
|
# ? Mar 12, 2022 02:30 |
|
Computer viking posted:I kind of hate how something that incredibly obviously involves at least two different kinds of "server" is described as "serverless", even if I understand the point. Serverless obviously means "a non-persistent server". This wouldnt apply to RDS of course but I included RDS because I've not used Aurora so I didnt want to say it was easy to set up. If he uses aurora though it would actually be serverless.
|
# ? Mar 12, 2022 06:09 |
|
What's a good data structure for sorting/selecting a range of dates? I'm thinking a B-tree-like structure might be good to organize by level like year->month->day->hour->minute->second, then storing a flat list at each leaf node. This has got to be a well-worn problem in software, but I'm unsure of the best approach. My dataset is a bunch of files read from disk that may not be read in chronological order, but do have a timestamp attached to them. spiritual bypass fucked around with this message at 03:55 on Mar 13, 2022 |
# ? Mar 13, 2022 03:53 |
|
There’s probably not enough value in making sure that entries which share the same month or whatever fall within a common subtree to justify a custom data structure or the reduced locality that would come with what you’re talking about. Any ordered collection should let you efficiently look up an index for a key (e.g. 2012-04) and scan forward from that point.
|
# ? Mar 13, 2022 04:18 |
|
|
# ? Jun 8, 2024 09:26 |
|
Yeah, my inclination is that you should use a generic sorted/ordered collection data structure, together with the standard DateTime type in your language. You could augment that with convenience methods that do things like "get me all the entries for month YYYY-MM" if you wanted.
|
# ? Mar 13, 2022 11:35 |