Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Jabor
Jul 16, 2010

#1 Loser at SpaceChem
Isn't this essentially a distributed union-find?

Adbot
ADBOT LOVES YOU

KillHour
Oct 28, 2007


Jabor posted:

Isn't this essentially a distributed union-find?

It might be! I'll look into that. Thanks. The briefest of Google searches seem to have a lot of references to grouping pixels/voxels together so I'm hoping the established algorithms are able to handle nodes with lots of edges. Also, a lot of these papers are from the last couple years so it seems like this is a pretty active area of research?

Edit: From this paper

https://arxiv.org/pdf/2003.02351.pdf

quote:

2.2 Distributed and Parallel Visualization and Analysis
Distributed union-find is mostly used for connected component labeling (CCL) in scientific visualization and analysis
algorithms, including distributed percolation analysis [7],
interval volume extraction [5], statistical analysis of connected regions [6], contour tree computation [2, 4], and
merge tree computation [3, 8]. Although the implementation of CCL [19] can be either union-find or breadth-first
search [20–24], union-find is proved to be more scalable
in distributed and parallel settings [13, 19] because unionfind organizes elements of connected components using tree
structures, explained in Section 3.1, which can efficiently
synchronize labels of elements in the connected components
across processes. The rest of this section samples typical
scientific visualization

Because our software is search-based and it's much simpler for me to get a list of all matches to a given document than it is to get a set of individual union operations, it seems like distributed breadth-first CCL is maybe something else to look for.

Double Edit: I haven't read enough yet to know if this is the case, but one of the things that immediately jumps out to me is that because the algorithms I can find focus on traversing images, they are likely optimized for deep but narrow trees. In practice, our application tends to be wide but shallow - it would be unusual for a cluster to take more than 2-3 iterations to find all connected values, but it would not be unusual for the cluster to have thousands of elements. Just thought I'd call that out, in case it's relevant.

The methods I'm seeing also tend to focus on a separation between local and global structures, where each thread focuses on a chunk of data and generates smaller trees, that are then connected together in a second pass - which makes sense if you have a spatial dimension you can divide on, but I don't.

Triple edit oh my god: Okay, I think I'm starting to understand union-find a little better and it seems like it could be applicable to situations where a single union is more than a pair. My biggest concern is that I have no guarantee that two threads won't decide to make two identical but backwards unions at the same time (and in fact, it's very likely) because I can't partition them into discrete sets (any document could theoretically match any other document, and the matching algorithm returns all matches in the database).

KillHour fucked around with this message at 07:06 on Feb 4, 2022

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

KillHour posted:

My company's software has a matching process that is really neat but I think the actual algorithm for iterating over everything isn't very efficient and there's probably a really clever way to do this better.

What is the actual problem statement here? You have a distributed database of (small) documents, and you want to take in a query and return a “superset” document that merges information from all the documents which partially match the query, with references to the original documents? Because it sounds like a trivially-distributed query with an inexpensive result-merging operation, and all of the extra complexity and expense is due to the effort to make it iterative.

KillHour
Oct 28, 2007


rjmccall posted:

What is the actual problem statement here? You have a distributed database of (small) documents, and you want to take in a query and return a “superset” document that merges information from all the documents which partially match the query, with references to the original documents? Because it sounds like a trivially-distributed query with an inexpensive result-merging operation, and all of the extra complexity and expense is due to the effort to make it iterative.

The problem statement is I have a document with lots of partial duplicates (names spelled wrong, different phone numbers, etc) and I want to process some kind of rule that says whether two documents are the "same" or not across all combinations of documents to end up with discrete sets of "mastered" entities. Think MDM. In certain circumstances, this might require multiple passes to ensure I get all matches. An example of this is if A would match B, A would NOT match C, and B would NOT match C, but the combination of A + B WOULD match C:

A:
Name: John Smyth
Address: 123 Fake Street
Phone: 555-555-5555
Email: jsmyth@company.com

B:
Name: John Smith
Address: 123 Fake Street
Phone: 555-555-5555
Email: johnsmyth@company.com

C:
Name: John Smith
Address: null
Phone: null
Email: jsmyth@company.com


So in this example, you need to merge A and B before you can realize that C should be merged as well, since C matches one thing from each of A and B but two things from the combination of them (assuming your heuristic is 2 fields or more match).

The thing that our database is really good at is returning search results. So it's very fast for me to say "give me the list of all documents that have propertyA = foo AND propertyB = bar AND propertyC != baz" or even "Give me the list of all documents where propertyA is within a levenshtein distance of 1 of "foo" (so boo and foq but not fuu) or lots of other heuristics. It's also fast to combine these operations together. Our database is not very good at "iterate over every document and do x to it" so anything that we can use the search functionality to do is preferred.

leper khan
Dec 28, 2010
Honest to god thinks Half Life 2 is a bad game. But at least he likes Monster Hunter.

KillHour posted:

The problem statement is I have a document with lots of partial duplicates (names spelled wrong, different phone numbers, etc) and I want to process some kind of rule that says whether two documents are the "same" or not across all combinations of documents to end up with discrete sets of "mastered" entities. Think MDM. In certain circumstances, this might require multiple passes to ensure I get all matches. An example of this is if A would match B, A would NOT match C, and B would NOT match C, but the combination of A + B WOULD match C:

A:
Name: John Smyth
Address: 123 Fake Street
Phone: 555-555-5555
Email: jsmyth@company.com

B:
Name: John Smith
Address: 123 Fake Street
Phone: 555-555-5555
Email: johnsmyth@company.com

C:
Name: John Smith
Address: null
Phone: null
Email: jsmyth@company.com


So in this example, you need to merge A and B before you can realize that C should be merged as well, since C matches one thing from each of A and B but two things from the combination of them (assuming your heuristic is 2 fields or more match).

The thing that our database is really good at is returning search results. So it's very fast for me to say "give me the list of all documents that have propertyA = foo AND propertyB = bar AND propertyC != baz" or even "Give me the list of all documents where propertyA is within a levenshtein distance of 1 of "foo" (so boo and foq but not fuu) or lots of other heuristics. It's also fast to combine these operations together. Our database is not very good at "iterate over every document and do x to it" so anything that we can use the search functionality to do is preferred.

If you have a fixed set of fields and your records maintain an absolute order, you could create an index of number of matched fields per record. If your number of fields is small, you could use a bit set for matching a field. If you use a bit set then merging results is a bitwise or in your index.

KillHour
Oct 28, 2007


Unfortunately, the "any two fields match" is just an easy example. Different fields could have different weights (even negative ones), and thresholds could be arbitrary. In addition, you can have multiple actions with different thresholds.

So to make the example slightly more complicated, you could say name is worth 10 points, phone number, email and address are each worth 5 points, and you need a total of 15 points to automatically merge or 10 points to go into a queue for manual review.

The number of fields and fields themselves are also arbitrary. We're not trying to solve this for a single particular situation and a single set of documents - this is a framework designed for any possible set of matching criteria on arbitrary documents.

Edit: wouldn't you need that bit set for each pair of records, not each individual record? There could easily be tens of millions of records.

KillHour fucked around with this message at 15:41 on Feb 4, 2022

HappyHippo
Nov 19, 2003
Do you have an Air Miles Card?

KillHour posted:


Triple edit oh my god: Okay, I think I'm starting to understand union-find a little better and it seems like it could be applicable to situations where a single union is more than a pair. My biggest concern is that I have no guarantee that two threads won't decide to make two identical but backwards unions at the same time (and in fact, it's very likely) because I can't partition them into discrete sets (any document could theoretically match any other document, and the matching algorithm returns all matches in the database).

So at work recently I've been implementing this thing called wait-free union find. I'm not sure if it specifically works for your case but it might be worth looking into

KillHour
Oct 28, 2007


HappyHippo posted:

So at work recently I've been implementing this thing called wait-free union find. I'm not sure if it specifically works for your case but it might be worth looking into

Thanks for the suggestion. Based on this paper - https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.8354&rep=rep1&type=pdf - it seems to assume there's a shared piece of memory that all threads can access to update their results. Because this is a transactional database, everything is inherently stateless and a thread only has access to itself. The only way to access data outside is to search (fast), read (kind of slow), or write (very slow) documents to the database. In addition, if I want to write a document without terminating the current thread, I need to create a new transaction in its own context, passing the data to be written to that transaction as a calling parameter, and then terminate that thread. In the current implementation, the thread / batch only ever writes out its state once at the very end.

What I really need is a completely stateless algorithm where one thread never needs to know what another is doing, other than being able to check the results of already completed and terminated threads. In addition, threads can spawn in any order and might have any arbitrary subset of starting documents.

Oh, and because of the way we handle scaling, it's possible for a server running a bunch of threads to just disappear mid transaction, so everything has to be guaranteed ACID so we can just resubmit that batch and be confident it's idempotent and not leaving garbage in the database. That is given to be true if it's done in a single transaction (since writes are committed after the transaction terminates), but that precludes spawning off transactions to write before the entire thing is finished.

....yeah.

Edit: Sorry if it sounds like I'm being obtuse. I promise the system can actually work really well if you can design your algorithm in discrete steps such that a given step can run parallel in a single transaction without writing data until the end. What we can do is take the output of a step in aggregate and use that as the input to the next step, but you want to limit that as much as possible since writing to disk is slow.

Double Edit: To make it more explicit how this is all orchestrated, we have a process that runs an arbitrary search on the database for some set of documents. Those documents are then split into batches (usually of between 1 and a few hundred documents) and for each batch, a V8 runtime is spun up in its own isolated context. I can order these documents by requesting the search results be ordered by their order in an index (so I could have all documents sorted alphabetically by last name, for instance), or I can leave them in "default" order, which is deterministic but pseudorandom. What I can't do is split the batches differently - they are always the same size (except the last batch) so it would be very difficult or impossible to ensure two given documents always end up in the same batch. This V8 context has access to the documents passed into it, as well as a set of APIs for searching, retrieving and writing documents to the database. Searching and retrieving documents happens immediately (assuming the document isn't locked by another thread) but writing only happens after the context is terminated and the transaction is complete. You can also spawn other transactions, but that's an antipattern for the reasons above (although sometimes it's unavoidable - when reading from two different databases, for instance). If you need to, you can let all the threads/batches complete and then run a new search on the database (likely for the output of the first step) and use those as input to the next step that repeats the process.

I think at this point I'm rubber ducking, but it's nice to write it all down anyways.

KillHour fucked around with this message at 04:41 on Feb 5, 2022

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
Counter-intuitively, I think it’s at least possible that you would scale better if you were less worried about doing unnecessary work. Like, you could fetch all the documents that match your starting document for any field, without trying to make the database query enforce the rule about needing a match in multiple fields, and then iteratively (1) locally figure out which of the fetched documents not already in your equivalence class now belong in it, (2) figure what field values you’ve just added for the first time to your superset document, (3) do a query asking for matches for any of those, and (4) go back to step 1, adding any new documents you found. You’ll do some redundant I/O, but your queries will be smaller and simpler to execute, and you won’t have to build a bunch of temporary tables.

And of course you can distribute this, so that you get all the transitive matches available locally on any node. It won’t stop you from having to do iterations, but you’ll probably converge a lot faster. And you can report partial matches back in your intermediate results, which would probably be useful in avoiding unnecessary iterations.

KillHour
Oct 28, 2007


rjmccall posted:

and you won’t have to build a bunch of temporary tables.

Not sure if this factors into what you're suggesting (since it's late and my brain isn't processing well) but we're a NoSQL database so there aren't any temporary tables (or tables at all). Data returned from a query is in JSON format and maps cleanly to a JS object when accessed (but is typically resolved lazily so you don't accidentally try to dump a huge query out from disk all at once). Complex nested queries are actually generally faster to do directly in the database than they are to do in the process itself. They work directly on the in-memory indexes on dedicated hardware and are implemented in C++ instead of running inside V8. As a general rule, any time I can use a native query to do some work for me instead of doing it in the context of the step, it's likely going to be faster.

As an example, I ran a query that said "give me the contents of all documents with this name" on a pretty small dev instance. It took about 14ms to return the contents of 508 documents. Then I said "give me the IDs of the documents that match at least two of the 3 following criteria {name, email, phone}" and it took 2.5ms to return the 162 IDs of the specific set of documents I wanted, purely because everything happened in memory and I didn't need to wait for the disks to retrieve the actual JSON data (I disabled result caching for the test).

The documents themselves aren't huge, but they aren't small either. They have a lot of fields I'm uninterested in for the purpose of mastering. I could get clever and have the DB only return the portions of the documents I care about (since it's a JSON tree I can use xpath for that), but it's probably still faster to go directly off of the indexes whenever possible.

One thing that might be worth looking into is that the hardware that performs queries is physically separate from the hardware that performs the batch jobs and the former doesn't dynamically scale (because it's tied to the actual drives) while the later can dynamically scale quite large. So if I can keep the I/O down, it might be worth trying to spread the work out a bit more. Definitely wouldn't want to start pulling entire documents for comparisons, though, since that would crush my I/O.

KillHour fucked around with this message at 06:46 on Feb 5, 2022

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
Alright, fair enough. You’d said something about checking in the database that you hadn’t already merged a particular document, which sounds like a temporary object of some sort; that’s all I was thinking.

KillHour
Oct 28, 2007


rjmccall posted:

Alright, fair enough. You’d said something about checking in the database that you hadn’t already merged a particular document, which sounds like a temporary object of some sort; that’s all I was thinking.

Oh yeah, since the threads are all starting and completing at different times (there could be a hundred thousand batches but you only run 100 threads at a time, for instance), the ones that are complete have already written their results out. Since data is indexed as soon as it's written, it's fast to do a quick check of "is there any document in the 'merge' collection that has this document ID in it already?" Since the index is sorted, it's just a binary tree search to say if it exists or not. If it exists, we can skip it. It is a temporary document, but we need that document to actually complete the merge pass so it's overall pretty efficient and saves a bunch of work.

Edit: Thanks for all the suggestions, seriously. It's helping me think more fully about what the system is actually doing instead of just the parts I'm inclined to focus on.

KillHour fucked around with this message at 07:14 on Feb 5, 2022

Woebin
Feb 6, 2006

I'm struggling with something that I think might just be a bad approach to begin with. Basically, I'm on a .Net + JS project mainly doing frontend stuff for a webshop, and we send requests / get responses from a backend that's maintained separately by a different team. When there are user-facing errors we show a message with a generic message while the real fault info is hidden (e.g. "user is flagged as unwanted customer" or whatever). Now there's a desire from testers to see the real fault info in that error message when in test environments.

So! The original fault info is now included as a header in the WCF response we get from said backend, and I can't for the life of me figure out how to propagate that to the frontend. I can intercept the message in an IClientMessageInspector and verify the header is present, but once it moves beyond that point all that remains is what's in the message body.

My gut feeling is "if it's meant to be available on the frontend it goes in the body", but that's considered a security issue since then anyone could see that the fault codes are being masked, even if the original fault data isn't present at this point in production.

I'm probably phrasing this all as a confusing mess, possibly because it feels entirely like a confusing mess to me. Is there a good way to do this? Is the approach of sending this data in a custom header reasonable in the first place?

I might just be dumb as hell.

Edit for tl;dr: I'm trying to get a custom header and its value from a WCF response and display it on a webpage, and it seems borderline impossible.

Woebin fucked around with this message at 13:35 on Feb 7, 2022

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS
Log the actual error message (making sure to sanitize any sensitive info) and give testers the ability to access those logs. Don’t send that stuff back in any part of the response if you don’t want all users to be able to see it.

Woebin
Feb 6, 2006

Blinkz0rz posted:

Log the actual error message (making sure to sanitize any sensitive info) and give testers the ability to access those logs. Don’t send that stuff back in any part of the response if you don’t want all users to be able to see it.
It's already getting logged, decision-makers just want it right there on the page. The backend only sends this stuff when it's in a test environment so it won't be visible to regular users, but I still think you're right.

Jabor
Jul 16, 2010

#1 Loser at SpaceChem
By "frontend" do you mean client-side JS? Or do you mean the frontend web server that's turning your backend responses into something suitable to send to user's browsers?

If there's something you want to keep secret, it shouldn't be sent to client-side JS at all. (Not even in a header!) You should be checking on your web server whether it's appropriate to show full debug information, and only if it is should you send that data down to the client.

What's in the message body you're getting from your backends? Surely it's some sort of structured message and not just raw HTML that goes down to clients unchanged?

Woebin
Feb 6, 2006

Jabor posted:

By "frontend" do you mean client-side JS? Or do you mean the frontend web server that's turning your backend responses into something suitable to send to user's browsers?

If there's something you want to keep secret, it shouldn't be sent to client-side JS at all. (Not even in a header!) You should be checking on your web server whether it's appropriate to show full debug information, and only if it is should you send that data down to the client.

What's in the message body you're getting from your backends? Surely it's some sort of structured message and not just raw HTML that goes down to clients unchanged?
Yeah, been a dev for five years and still don't know how to phrase this stuff well, sorry! It's more like the external backend sends us a response, our own C# bits receive it server-side as a SOAP envelope with a raw XML body and some headers, then somewhere in the chain that I can't find this gets parsed and passed on to the client as basically just the body contents but now in JSON. So I can look at the response as we get it in the C# part of things, but can't take what I find there and add it to what reaches the client (currently what the client gets is the masked fault info, consisting of a generic error code and message saying to call customer service; this is also in the body of the original response).

Jabor
Jul 16, 2010

#1 Loser at SpaceChem
So the "right" answer here is that the external backend shouldn't have that much control over what your frontend sends down to clients. Your frontend should have responsibility for its own JSON, and the backend should be able to add information as it pleases to your internal messaging without that affecting what gets sent down to clients.

Actually getting to that point from your current situation sounds like it might be a challenge though.

Figuring out where exactly your frontend turns WCF Messages into JSON is probably the best next step - it's at that point you could look at the Properties on the message and either include or not include the extra debugging data in the JSON you emit (depending on whether it's a test server or not).

Woebin
Feb 6, 2006

Jabor posted:

So the "right" answer here is that the external backend shouldn't have that much control over what your frontend sends down to clients. Your frontend should have responsibility for its own JSON, and the backend should be able to add information as it pleases to your internal messaging without that affecting what gets sent down to clients.

Actually getting to that point from your current situation sounds like it might be a challenge though.

Figuring out where exactly your frontend turns WCF Messages into JSON is probably the best next step - it's at that point you could look at the Properties on the message and either include or not include the extra debugging data in the JSON you emit (depending on whether it's a test server or not).
We've got two sister teams working on their own thing, there's my team and the team that builds the "external" backend. Formally we're considered part of the same team but in reality they don't touch or understand our code and we don't touch or understand theirs. The whole setup is honestly kind of a nightmare IMO.

But yeah, I think your point holds and is also where I'm at - I need to figure out where we break down the original response message.

Or there's the other option - convince the testers and business people that this is a bad idea and they should just learn to check the logs.

Days like this really make me feel like I'm terrible at my job.

KillHour
Oct 28, 2007


Enjoy the poo poo storm that happens when someone forgets to check a box and the test version goes into prod.

Woebin
Feb 6, 2006

KillHour posted:

Enjoy the poo poo storm that happens when someone forgets to check a box and the test version goes into prod.
More likely there's a

Java code:

if (requestOrigin == "test environment url") {
  send original fault data
} else {
  don't
}
somewhere in the backend code, with a raw string. Which I'm sure is all fine.

KillHour
Oct 28, 2007


I give people too much credit

teen phone cutie
Jun 18, 2012

last year i rewrote something awful from scratch because i hate myself
i'm being tasked with learning Flutter for work. All expenses paid.

What's the best resources for learning Flutter, not including a whole-rear end bootcamp?

Volmarias
Dec 31, 2002

EMAIL... THE INTERNET... SEARCH ENGINES...

teen phone cutie posted:

i'm being tasked with learning Flutter for work. All expenses paid.

What's the best resources for learning Flutter, not including a whole-rear end bootcamp?

If you don't know dart yet, start there

America Inc.
Nov 22, 2013

I plan to live forever, of course, but barring that I'd settle for a couple thousand years. Even 500 would be pretty nice.
This is more of a networking question, maybe there isn't going to be much actual programming involved.

I live in the US but I sometimes have to use NordVPN to switch to a Spanish IP so I can, for example, watch Spanish movies in the original language on Netflix. I have multiple devices including a TV connected to a router, and I want to be able to change location on a dime so I can watch Spanish Netflix on my TV.

I have a raspberry pi which I have used before for Pi-Hole, and I suspect I could do something similar for this where I connect the Pi by NordVPN and route all traffic through it. Does that make sense? It's the next best thing to just loving moving to Spain

spiritual bypass
Feb 19, 2008

Grimey Drawer
Some routers can do a VPN for the whole network if that's what you want. Taking the Raspberry Pi approach would mean essentially making your own router. Some people do this, but it's complex and not worth it for most people.

yippee cahier
Mar 28, 2005

Coincidentally I am just bringing up a new OpenWrt router that supports pihole-style DNS adblocking and can run a VPN client. Quite happy with the ease of use and the flexibility so far. I don't think it'd be too difficult to to script up a toggle switch to change your VPN location, for instance. There's definitely a Pi image available.

Spikes32
Jul 25, 2013

Happy trees
Dumb question incoming. I want to check Google map times for preset routes several times a day and log the times. I don't want to trust 'estimated times' that Google gives when you select your departure time because there's no report of standard deviation. I've tried googling this, but just get stuck in Google map how to pages. Has anyone heard of this bring done already, or do I need to set up something myself, if it's possible?

Jabor
Jul 16, 2010

#1 Loser at SpaceChem

Spikes32 posted:

Dumb question incoming. I want to check Google map times for preset routes several times a day and log the times. I don't want to trust 'estimated times' that Google gives when you select your departure time because there's no report of standard deviation. I've tried googling this, but just get stuck in Google map how to pages. Has anyone heard of this bring done already, or do I need to set up something myself, if it's possible?

https://developers.google.com/maps/documentation/directions/overview

Spikes32
Jul 25, 2013

Happy trees
Yup OK that's what I expected.

Junkiebev
Jan 18, 2002


Feel the progress.

Kong or Istio (or a different thing) for JWT claim-based routing (user A goes to Service green, user B goes to Service blue) - who ya got?

Junkiebev fucked around with this message at 01:46 on Feb 26, 2022

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


There's a new general AI/ML thread for anyone who wants to see that.

Gin_Rummy
Aug 4, 2007
Can anyone recommend good projects to learn/apply enough knowledge to become proficient in:

- AWS/GCP (in whatever capacity companies seem to want me to be able to use it for)
- Docker/Kubernetes
- Django/MongoDB

These seem to be some of the bigger/most frequently seen techs I see in job postings, and I am tired of my current job's "we only make CRUD apps" philosophy getting in my way of moving on.

Furthermore, if there are any other really big/useful frameworks anyone could recommend, I am all ears.

camoseven
Dec 30, 2005

RODOLPHONE RINGIN'

Gin_Rummy posted:

Can anyone recommend good projects to learn/apply enough knowledge to become proficient in:

- AWS/GCP (in whatever capacity companies seem to want me to be able to use it for)
- Docker/Kubernetes
- Django/MongoDB

These seem to be some of the bigger/most frequently seen techs I see in job postings, and I am tired of my current job's "we only make CRUD apps" philosophy getting in my way of moving on.

Furthermore, if there are any other really big/useful frameworks anyone could recommend, I am all ears.

Pluralsight is expensive (try to get your company to pay for it, maybe) but I cannot recommend it highly enough for these specific technologies. I have learned so god drat much about them from their courses.

That said, just diving in with AWS isn't *that* bad. Make a free account, start with a basic Lambda function hooked to an API Gateway, and you can be up and running in like 5 minutes doing fun things

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Gin_Rummy posted:

Can anyone recommend good projects to learn/apply enough knowledge to become proficient in:

- AWS/GCP (in whatever capacity companies seem to want me to be able to use it for)
- Docker/Kubernetes
- Django/MongoDB

These seem to be some of the bigger/most frequently seen techs I see in job postings, and I am tired of my current job's "we only make CRUD apps" philosophy getting in my way of moving on.

Furthermore, if there are any other really big/useful frameworks anyone could recommend, I am all ears.

Yea here's a project that will hit all of those, be relatively simple to follow hello world tutorials for and is basically free.

Project: Make a serverless web app to display each item in some data set on a unique page.

Technology overview: Use the AWS SAM CLI to set up a docker container w/your django project that you will deploy to AWS Lambda, connect to an AWS RDS db, and provide a RESTful API for.

Steps:
Using the AWS SAM CLI set up a docker container to deploy to AWS Lambda.
Once you have your project, use Django + Django CMS to make a hello world app, deploy it with your container.
Expand your hello world webpage such that it will display rows in a database as a custom page. Can all be done from the backend, don't need to have frontend calls or anything.
Use Django Rest Framework to provide an authenticated a REST API to access the data.

Make a nice readme with some images and a clean, pinned requirements.txt if you want this to be fodder for your GitHub

Computer viking
May 30, 2011
Now with less breakage.

I kind of hate how something that incredibly obviously involves at least two different kinds of "server" is described as "serverless", even if I understand the point.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Computer viking posted:

I kind of hate how something that incredibly obviously involves at least two different kinds of "server" is described as "serverless", even if I understand the point.

Serverless obviously means "a non-persistent server". This wouldnt apply to RDS of course but I included RDS because I've not used Aurora so I didnt want to say it was easy to set up. If he uses aurora though it would actually be serverless.

spiritual bypass
Feb 19, 2008

Grimey Drawer
What's a good data structure for sorting/selecting a range of dates? I'm thinking a B-tree-like structure might be good to organize by level like year->month->day->hour->minute->second, then storing a flat list at each leaf node. This has got to be a well-worn problem in software, but I'm unsure of the best approach.

My dataset is a bunch of files read from disk that may not be read in chronological order, but do have a timestamp attached to them.

spiritual bypass fucked around with this message at 03:55 on Mar 13, 2022

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
There’s probably not enough value in making sure that entries which share the same month or whatever fall within a common subtree to justify a custom data structure or the reduced locality that would come with what you’re talking about. Any ordered collection should let you efficiently look up an index for a key (e.g. 2012-04) and scan forward from that point.

Adbot
ADBOT LOVES YOU

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
Yeah, my inclination is that you should use a generic sorted/ordered collection data structure, together with the standard DateTime type in your language. You could augment that with convenience methods that do things like "get me all the entries for month YYYY-MM" if you wanted.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply