Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Kreeblah
May 17, 2004

INSERT QUACK TO CONTINUE


Taco Defender

ExcessBLarg! posted:

A shell script that packs the filename into a larger string and calls eval on it.

I mean, that's fair. And I did fix that earlier today (your comments on how to more safely sanitize stuff in bash were super helpful, since it turns out that zsh can use the same trick, so thanks for pointing that out).

The response I got on the bug I filed with the MAME folks, on the other hand, was not super heartening, though, given that it started with "This isn’t a bug per se, it’s just the result of a poorly implemented feature."

Kuule hain nussivan posted:

Good thing you figured it out! Sorry I couldn't be of more help.

I appreciate the ideas regardless. The various shell scripting languages aren't really ones I'm all that familiar with, so any feedback somebody's willing to give me is welcome.

Adbot
ADBOT LOVES YOU

Tiny Myers
Jul 29, 2021

say hello to my little friend


I'm a baby programmer who cut their teeth on Python. I love Python and I enjoyed using PyGame, but the problem of course is that if anyone wants to play my dumb little games they have to download and run an EXE which isn't always going to reach as big an audience as something in the browser. So I'm kind of trying to figure out what to transition to.

What is the best language or engine for me to learn to make games for the browser? I don't need a lot of bells and whistles, I just want to make simple 2D indie games by myself that I can put on itch.io and share around. Stuff that looks like Papers, Please or Baba Is You, simple pixel graphics (I'm an artist first, I can handle the art part) and interface and doesn't have physics.

bigperm
Jul 10, 2001
some obscure reference
I don't have any experience with it but if you are just making games for the web, there are a ton of JavaScript game engines.

https://github.com/collections/javascript-game-engines

They look pretty cool and you could wrap it up in an electron app to distribute.

leper khan
Dec 28, 2010
Honest to god thinks Half Life 2 is a bad game. But at least he likes Monster Hunter.

Tiny Myers posted:

I'm a baby programmer who cut their teeth on Python. I love Python and I enjoyed using PyGame, but the problem of course is that if anyone wants to play my dumb little games they have to download and run an EXE which isn't always going to reach as big an audience as something in the browser. So I'm kind of trying to figure out what to transition to.

What is the best language or engine for me to learn to make games for the browser? I don't need a lot of bells and whistles, I just want to make simple 2D indie games by myself that I can put on itch.io and share around. Stuff that looks like Papers, Please or Baba Is You, simple pixel graphics (I'm an artist first, I can handle the art part) and interface and doesn't have physics.

Unity's web player is a little lovely but works fine for this.

Whoreson Welles
Mar 4, 2015

ON TO THE NEXT PAGE!
I’m a Python amateur and wanted to start dipping my toes into HMTL, javascript, and goofing off in web design.

Eventually I’d like to purchase a cheap domain to screw around on and show off to my friends. Does anyone have recommendations on where to register/host that will allow me to build from scratch without trying to upsell me a million things or make me use a handholdey site builder like Squarespace or Godaddy?

leper khan
Dec 28, 2010
Honest to god thinks Half Life 2 is a bad game. But at least he likes Monster Hunter.

Whoreson Welles posted:

I’m a Python amateur and wanted to start dipping my toes into HMTL, javascript, and goofing off in web design.

Eventually I’d like to purchase a cheap domain to screw around on and show off to my friends. Does anyone have recommendations on where to register/host that will allow me to build from scratch without trying to upsell me a million things or make me use a handholdey site builder like Squarespace or Godaddy?

I use namecheap as a registrar. They're fine. I run some services from my home network. They have a relatively cheap shared hosted offering if that's what you're looking for.

There's a goon affiliate for lithium hosting I think.

Armauk
Jun 23, 2021


Whoreson Welles posted:

Does anyone have recommendations on where to register/host that will allow me to build from scratch without trying to upsell me a million things or make me use a handholdey site builder like Squarespace or Godaddy?

Domain registrars: Namecheap, Name, Google

Hosts: Vultr, https://lowendbox.com/

Obfuscation
Jan 1, 2008
Good luck to you, I know you believe in hell
For hosting random projects I would recommend Github Pages, it's very easy to use and completely free. For domains, I use Namecheap but I honestly have no idea if there's something better out there.

leper khan
Dec 28, 2010
Honest to god thinks Half Life 2 is a bad game. But at least he likes Monster Hunter.
Speaking of domains. Googles free webmail for grandfathered domains is ending in the very near term. What's the best cheapest option for hosting mail. If this was 1999 I'd manage it myself. But I'd like to both receive email and have others receive my email. And hosting it myself won't get me those outcomes today.

Pollyanna
Mar 5, 2005

Milk's on them.


Is it worth it to break up a package with an annoying-to-set-up-tests-for public API for the sole purpose of making testing easier?

We have a self-contained collection of code (call it a module, class, package, applies the same either way). We’re branching some logic deep in the call stack to one of our public methods, where the change itself is rather simple and just passes down some data from further up the stack to make a decision later.

The problem is that we need to input a large, complex state to the public method’s tests for each case we want to test, and this state will have a lot of data that’s orthogonal to the change we’re making but is still necessary to get any output at all. We’ve been able to do this a couple times, but now I gotta build a big ol’ graph for this change and I’m getting loving sick of it.

I’ve proposed unit testing the private functions directly, but that’s discouraged - we follow the best practice of “test code using real APIs - don’t define back doors only for tests”, and unit testing private functions violates that. An alternative I’ve explored is making it simple and easy to set up the large state we want to test against, but that only resolves so much pain.

So, now I’m considering a different solution - just make packages around the logic I want to test. But I have no idea if that actually solves the problem. Would it be worth it to split the package up into smaller ones, and retain a larger “composition” package that uses public APIs of the smaller packages? Or does that violate the spirit of the best practice?

KillHour
Oct 28, 2007


leper khan posted:

Unity's web player is a little lovely but works fine for this.

I really like unity's web player because it lets me cut a prototype release, upload it to twitch, and still have the ability to do a "real" release after.

There are lots of other reasons to not use Unity though - least of which is the learning curve.

pokeyman
Nov 26, 2006

That elephant ate my entire platoon.

leper khan posted:

Speaking of domains. Googles free webmail for grandfathered domains is ending in the very near term. What's the best cheapest option for hosting mail. If this was 1999 I'd manage it myself. But I'd like to both receive email and have others receive my email. And hosting it myself won't get me those outcomes today.

I use Fastmail and have for years, zero complaints. Not sure it's the cheapest though.

Or wait for Google to change their mind for the hundredth time :)

Whoreson Welles
Mar 4, 2015

ON TO THE NEXT PAGE!

leper khan posted:

I use namecheap as a registrar. They're fine. I run some services from my home network. They have a relatively cheap shared hosted offering if that's what you're looking for.

There's a goon affiliate for lithium hosting I think.

Thanks for the input everyone!

b0lt
Apr 29, 2005

leper khan posted:

Speaking of domains. Googles free webmail for grandfathered domains is ending in the very near term. What's the best cheapest option for hosting mail. If this was 1999 I'd manage it myself. But I'd like to both receive email and have others receive my email. And hosting it myself won't get me those outcomes today.

They backed out of this: https://arstechnica.com/gadgets/2022/05/google-backtracks-on-legacy-gsuite-account-shutdown-wont-take-user-emails/

leper khan
Dec 28, 2010
Honest to god thinks Half Life 2 is a bad game. But at least he likes Monster Hunter.

Awesome. Gives me some time to migrate off more gracefully.

Computer viking
May 30, 2011
Now with less breakage.

If you would like a EU based provider for whatever reason (e.g. the benefits of GDPR compliance), I use mailbox.de . The only drawbacks I can think of is that sometimes a little bit of German leaks through, and it can get a bit expensive if you want to host multiple independent users on your domain.

Gothmog1065
May 14, 2009
Another goofy question as I'm still learing bits about awk and print and stuff.

I'm trying to parse a directory to get a list of sub directories that have a specific file in them. Trying to get that list separated into two variables for processing.

/dir/foo_bar/siteInfo

I'm basically wanting foo and bar split (but there will be multiples) into two variables. The easiest way is probably what a coworker told me to do:

code:
while read -r -d $'\0' line; do
  baz=${line##/*}
  fooList="${fooList} ${baz#*_}
  barList="${barList} ${baz%_*}
done < <$(find /dir -maxdepth 2 -type f -name "siteInfo" -printf '%h\0' 2> dev/null)
But I wasn't sure if I coudl get around it with less work with awk. Secondly, what the hell does %h\0 do? As far as I can tell, it has something to do with printf expecting a length of a short integer, but is that 0? How does that translate to the $'\0' above?

lifg
Dec 4, 2000
<this tag left blank>
Muldoon

Pollyanna posted:

Is it worth it to break up a package with an annoying-to-set-up-tests-for public API for the sole purpose of making testing easier?

We have a self-contained collection of code (call it a module, class, package, applies the same either way). We’re branching some logic deep in the call stack to one of our public methods, where the change itself is rather simple and just passes down some data from further up the stack to make a decision later.

The problem is that we need to input a large, complex state to the public method’s tests for each case we want to test, and this state will have a lot of data that’s orthogonal to the change we’re making but is still necessary to get any output at all. We’ve been able to do this a couple times, but now I gotta build a big ol’ graph for this change and I’m getting loving sick of it.

I’ve proposed unit testing the private functions directly, but that’s discouraged - we follow the best practice of “test code using real APIs - don’t define back doors only for tests”, and unit testing private functions violates that. An alternative I’ve explored is making it simple and easy to set up the large state we want to test against, but that only resolves so much pain.

So, now I’m considering a different solution - just make packages around the logic I want to test. But I have no idea if that actually solves the problem. Would it be worth it to split the package up into smaller ones, and retain a larger “composition” package that uses public APIs of the smaller packages? Or does that violate the spirit of the best practice?

I thought only testing the public methods was more of a recommendation than a strict requirement of TDD?

But I’ve done “An alternative I’ve explored is making it simple and easy to set up the large state we want to test against,” in the past. It was okay initially, but the setup kept getting more complicated and developed its own set of problems. YMMV.

Gothmog1065
May 14, 2009
FYI, the above and this is in shell (specifically bash). But here's an easier (i hope) one:

Is there not a way to use find to find a partial directory when I know the beginning? Using the above example:

/dir/foo_bar/siteInfo

$ cd /dir
$ find . -path "foo*/siteInfo"

does not work, but if I'm searching for the beginning it's fine:

$find . -path "*bar/siteInfo"
/dir/foo_bar/siteInfo

Is there a way to accomplish the first action?

necrotic
Aug 2, 2005
I owe my brother big time for this!
You need either the full prefix or another * for the first path part I think?

ExcessBLarg!
Sep 1, 2001

Gothmog1065 posted:

But I wasn't sure if I coudl get around it with less work with awk.
Probably not awk, but again ruby makes this pretty trivial:
code:
ruby -e 'fooList, barList = Dir["**/siteInfo"].map {|f| f.split("/")[-2].split("_")}.transpose; p fooList; p barList'
or whatever you want to do with the two lists. Alternatively, if you wanted to write this portably for systems that use something other than "/" as the directory separator this is arguably a more correct version:
code:
ruby -e 'fooList, barList = Dir[File.join("**", "siteInfo")].map {|f| File.basename(File.dirname(f)).split("_")}.transpose; p fooList; p barList'

Gothmog1065 posted:

Secondly, what the hell does %h\0 do? As far as I can tell, it has something to do with printf expecting a length of a short integer, but is that 0? How does that translate to the $'\0' above?
If you look at man find, the "%h" directive under the -printf option indicates to print the dirname (leading directories of the file name). The "\0" is a null terminator. So basically for every matching path you're asking find to print the leading directories separated with a null terminator. Meanwhile the "-d $'\0'" says to split each "line" on a null terminator. This is a common idiom to iterating over a list of find results while being compatible with paths that may contain a new line in them, since the default behavior of find and read is to use a newline as the record separator. That said, this "trick" breaks down as soon as you append your parsed results to a whitespace separated result list.

If you wanted to use the shell version (as opposed to ruby or something else) I think the way it's written is, fine, but I'd probably use bash arrays for fooList and barList to avoid the compatibility issue with paths that contain whitespace. Or maybe that's not relevant to you and really doesn't matter.

ExcessBLarg!
Sep 1, 2001

Gothmog1065 posted:

Is there a way to accomplish the first action?
Unless I'm missing something you should just be able to do:
code:
for f in dir/foo*/siteInfo; do echo "$f"; done
What are you trying to gain by using find? Do you have more directories (10,000+ or something) than a shell glob will support?

Gothmog1065
May 14, 2009

ExcessBLarg! posted:

Unless I'm missing something you should just be able to do:
code:
for f in dir/foo*/siteInfo; do echo "$f"; done
What are you trying to gain by using find? Do you have more directories (10,000+ or something) than a shell glob will support?

In this instance, nothing in particular in terms of speed. However, the script is going to have various options that will be executed in various ways:
/dir/dontCareDir/ (no siteInit)
/dir/sd_foo/siteInfo
/dir/sd_bar/siteInfo
/dir/sd_baz/siteInfo
/dir/sft_foo/siteInfo
/dir/sft_bar/siteInfo

1 - There will be an 'all' function that will search every directory for the siteInfo
2 - There will be a 'group' function that will work on a specific group (sd or sft)
3 - there will be a 'single' function that will work on a single directory.

3 will be handled without the loop (of course), but I was looking to see if there was a way to adapt the find command above to incorporate #2. So going by what you said, it'll be more along the lines of:

code:
selection=sd
[[ $all = "YES" ]] && searchPath="*" || searchPath="${selection}_*"
for f in dir/$"searchPath"/siteInfo; do echo "$f"; done
Where selection will be determined in other ways.


ExcessBLarg! posted:

Probably not awk, but again ruby makes this pretty trivial:
code:
ruby -e 'fooList, barList = Dir["**/siteInfo"].map {|f| f.split("/")[-2].split("_")}.transpose; p fooList; p barList'
or whatever you want to do with the two lists. Alternatively, if you wanted to write this portably for systems that use something other than "/" as the directory separator this is arguably a more correct version:
code:
ruby -e 'fooList, barList = Dir[File.join("**", "siteInfo")].map {|f| File.basename(File.dirname(f)).split("_")}.transpose; p fooList; p barList'
If you look at man find, the "%h" directive under the -printf option indicates to print the dirname (leading directories of the file name). The "\0" is a null terminator. So basically for every matching path you're asking find to print the leading directories separated with a null terminator. Meanwhile the "-d $'\0'" says to split each "line" on a null terminator. This is a common idiom to iterating over a list of find results while being compatible with paths that may contain a new line in them, since the default behavior of find and read is to use a newline as the record separator. That said, this "trick" breaks down as soon as you append your parsed results to a whitespace separated result list.

If you wanted to use the shell version (as opposed to ruby or something else) I think the way it's written is, fine, but I'd probably use bash arrays for fooList and barList to avoid the compatibility issue with paths that contain whitespace. Or maybe that's not relevant to you and really doesn't matter.
Sadly, no ruby being used. I believe my confusion with %h was coming from the fact I was looking under man printf and not man find, which makes a world of difference.

I should use arrays regardless since there will be duplicates and one of the functions that I'll be doing will be sort -u so those are no longer there. Using the arrays will be easy within for f in dir<blah>, and easier to search.

ExcessBLarg!
Sep 1, 2001
Honestly the scope of what you're describing sounds more appropriate as a Python (or, if it were me, Ruby) script. I'd think you'd want a language that natively supports arrays, sets, sorting, provides straight-forward tools for text parsing, and doesn't stymie you with quoting issues around every corner.

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


ExcessBLarg! posted:

Honestly the scope of what you're describing sounds more appropriate as a Python (or, if it were me, Ruby) script. I'd think you'd want a language that natively supports arrays, sets, sorting, provides straight-forward tools for text parsing, and doesn't stymie you with quoting issues around every corner.

Agreed 100%. If you have to use shell scripts it's nice to know how to do things but nowadays there are usually better options.

ArcticZombie
Sep 15, 2010
I'm not clear on exactly what you're trying to do. Given this directory structure:

code:
/dir/dontCareDir/ (no siteInit)
/dir/sd_foo/siteInfo
/dir/sd_bar/siteInfo
/dir/sd_baz/siteInfo
/dir/sft_foo/siteInfo
/dir/sft_bar/siteInfo
What's the expected output for the "all" case?

ArcticZombie fucked around with this message at 16:12 on May 28, 2022

Pollyanna
Mar 5, 2005

Milk's on them.


lifg posted:

I thought only testing the public methods was more of a recommendation than a strict requirement of TDD?

But I’ve done “An alternative I’ve explored is making it simple and easy to set up the large state we want to test against,” in the past. It was okay initially, but the setup kept getting more complicated and developed its own set of problems. YMMV.

Yeah, it’s a recommendation, but the rest of the team supports it, and I’m more interested in working well alongside them than arguing dogma.

This is all temporary, anyway - this codebase is going to be replaced with one that doesn’t need the large set of input data (thank god).

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE
Is PROLOG still the defacto standard for knowledge-basing? And how does one go about interacting with a PROLOG interpreter in this era of containerized everything? Or would you embed a knowledge base into another application somehow?

The thing that got me thinking down this road was ffmpeg and how some old/obscure files can be very difficult to process because of container/codec/colorspace issues. The solution that occurred to me was basically Prolog-style route finding… encode the compatibilities of each filter into a knowledge base and pass the request as an (input,output) tuple and see if it can be satisfied.

That’s not a problem that I’m actively going to spend time solving but it’s a pattern that I actually see a lot where you just wanna see if there’s a “route” that exists, and if there’s more than one what the min-cost is.

I suppose the other way you could do it would be graph searching… each characteristic (color depth, etc) becomes a dimension and you just do a multidimensional TSP that finds the min length to each node (and your destination).

Paul MaudDib fucked around with this message at 00:21 on Jun 2, 2022

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


That's a constraint satisfaction problem and there are a ton of tools for working with them out there.

Sab669
Sep 24, 2009

Here's a really dumb question. What is the name for the CSS syntax where you're defining a style that gets added only to elements with multiple classes?

Like:
code:
.nav-link{
//some styles I want applied to both navs on my application
}

.nav-link.vertical{
//styles that only get applied to my nested navs
}
I thought this was a "combinator" but apparently that's something different

Macichne Leainig
Jul 26, 2012

by VG
I think what you're looking for is a CSS selector. You can just select objects with multiple classes with that syntax.

12 rats tied together
Sep 7, 2006

The dot is a selector, specifically, the class selector

Sab669
Sep 24, 2009

Right-o, thanks. I'm really not a designer if you couldn't tell :v: Just wanted to do some reading on it though

Sirocco
Jan 27, 2009

HEY DIARY! HA HA HA!
I'm rendering sprites to two different textures in OpenGL, one's for the main game, and the other is for a cutscene. The point of this is so that I can run a shader program to have a transition effect where the cutscene moves onto the screen and pushes the gameplay screen off. For testing purposes, I've rendered a single small sprite to the centre of the second texture. But when I try to sample from that texture the sprite's not small but fills up the whole of it and I can't quite seem to figure out why. The transition works fine, but instead of a black screen coming down with a small sprite in the centre, I just get an enormous sprite filling up the screen.

Any ideas where I'm going wrong here?

RPATDO_LAMD
Mar 22, 2013

🐘🪠🍆

Sirocco posted:

I'm rendering sprites to two different textures in OpenGL, one's for the main game, and the other is for a cutscene. The point of this is so that I can run a shader program to have a transition effect where the cutscene moves onto the screen and pushes the gameplay screen off. For testing purposes, I've rendered a single small sprite to the centre of the second texture. But when I try to sample from that texture the sprite's not small but fills up the whole of it and I can't quite seem to figure out why. The transition works fine, but instead of a black screen coming down with a small sprite in the centre, I just get an enormous sprite filling up the screen.

Any ideas where I'm going wrong here?

You're pretty light on details here, and the exact issue is probably somewhere in the details.
Render-to-texture should work out just the same as rendering to a framebuffer. Does that single small sprite show up as the correct size etc if you render it to the main framebuffer instead of the texture?

Hughmoris
Apr 21, 2007
Let's go to the abyss!
Need some rookie advice on being better at github and portfolios.

I'm walking thru some AWS tutorials and projects, and I'd like to put them on github. If I'm doing 10 small data engineering project, do I create one repo and put the 10 projects in there? Or should I create 10 different repos, one for each project? I don't know what best practice is on how to structure all this.

*Each project will mostly just be an IaC file and some screenshots + small writeup.

leper khan
Dec 28, 2010
Honest to god thinks Half Life 2 is a bad game. But at least he likes Monster Hunter.

Hughmoris posted:

Need some rookie advice on being better at github and portfolios.

I'm walking thru some AWS tutorials and projects, and I'd like to put them on github. If I'm doing 10 small data engineering project, do I create one repo and put the 10 projects in there? Or should I create 10 different repos, one for each project? I don't know what best practice is on how to structure all this.

*Each project will mostly just be an IaC file and some screenshots + small writeup.

In general no one will look at that.

CarForumPoster
Jun 26, 2013

⚡POWER⚡
Is it okay to ask here about how to structure data?

Use Case
I search through government data using ~10 attributes e.g. company names, addresses, phone numbers, emails, identifier like CAGE codes, etc. Currently I have ~50 tables in BigQuery with public record government company and payments data. (e.g. one table is PPP loan data). A lot of the searching I do is regexes or partial matches, in the future I'd like to do some fuzzy logic or trigrams type searches. Sometimes companies will have multiples of these attribute, e.g. a company might have a registered address, mailing address, and 5 officers at different addresses, all on one row presently.

Several of these tables have 200+ columns with data about a transaction, and I want to keep that data but the only data I need to be searchable is with my known list attributes. I have the columns mapped to what type of data is in them for all of the tables and I basically loop over the relevant ones doing searches presently. This is slow to use but was fast to build and is 1000x faster and better than the alternatives I had before. I often only search a subset of the sources.

Now I'd like to automate much of this searching.

Proposed Solution
I was thinking of making one big rear end BQ table, with the columns I care about (and a few with info about the source table, dates, etc) as individual columns the rest of the data I dont care about in an "other" column. My thought is to structure all of the columns as JSON or strings where appropriate. I'm hoping it could be possible to retain the source's column names in the json, eg if addresses are in mailing_address and registered_agent_address in the source table, have both of those as key names in an "address" column that I can search the values of for matches.

So in one row something like: (transposed for readability)
code:
id | source | company_name | phone | address1 | address2 | city | .... | other
"asdf-123" | 
"ppp_loans.csv" | 
{"company_name": "Example Inc.", "dba_name": "Example Co"} | 
{"office": "5550122", "fax": "5550123"} | 
{"mailing_abcxyz": "123 Fake St","othertype":"456 Good Pl"}|
{}|
{"city1":"Los Angeles", "city2": "Los Angeles"}|
{"amount":100, "rando_col_name": "asdfkaslhfsa"}|
...and so on.

All of the sources are flat tables.


Is there some obvious downside to this I am not aware of? It looks like BigQuery supports structuring data this way, but I am not sure how I will be able to do my regex searches on all the "address" values if I don't already know they key names.

CarForumPoster fucked around with this message at 15:40 on Jun 6, 2022

KillHour
Oct 28, 2007


CarForumPoster posted:

Is it okay to ask here about how to structure data?

Use Case
I search through government data using ~10 attributes e.g. company names, addresses, phone numbers, emails, identifier like CAGE codes, etc. Currently I have ~50 tables in BigQuery with public record government company and payments data. (e.g. one table is PPP loan data). A lot of the searching I do is regexes or partial matches, in the future I'd like to do some fuzzy logic or trigrams type searches. Sometimes companies will have multiples of these attribute, e.g. a company might have a registered address, mailing address, and 5 officers at different addresses, all on one row presently.

Several of these tables have 200+ columns with data about a transaction, and I want to keep that data but the only data I need to be searchable is with my known list attributes. I have the columns mapped to what type of data is in them for all of the tables and I basically loop over the relevant ones doing searches presently. This is slow to use but was fast to build and is 1000x faster and better than the alternatives I had before. I often only search a subset of the sources.

Now I'd like to automate much of this searching.

Proposed Solution
I was thinking of making one big rear end BQ table, with the columns I care about (and a few with info about the source table, dates, etc) as individual columns the rest of the data I dont care about in an "other" column. My thought is to structure all of the columns as JSON or strings where appropriate. I'm hoping it could be possible to retain the source's column names in the json, eg if addresses are in mailing_address and registered_agent_address in the source table, have both of those as key names in an "address" column that I can search the values of for matches.

So in one row something like: (transposed for readability)
code:
id | source | company_name | phone | address1 | address2 | city | .... | other
"asdf-123" | 
"ppp_loans.csv" | 
{"company_name": "Example Inc.", "dba_name": "Example Co"} | 
{"office": "5550122", "fax": "5550123"} | 
{"mailing_abcxyz": "123 Fake St","othertype":"456 Good Pl"}|
{}|
{"city1":"Los Angeles", "city2": "Los Angeles"}|
{"amount":100, "rando_col_name": "asdfkaslhfsa"}|
...and so on.

All of the sources are flat tables.


Is there some obvious downside to this I am not aware of? It looks like BigQuery supports structuring data this way, but I am not sure how I will be able to do my regex searches on all the "address" values if I don't already know they key names.

It looks to me like you're trying to stick multiple dimensions in a single column. Why not just use a document database that can handle JSON natively if you want to store data that way?

Adbot
ADBOT LOVES YOU

CarForumPoster
Jun 26, 2013

⚡POWER⚡

KillHour posted:

It looks to me like you're trying to stick multiple dimensions in a single column. Why not just use a document database that can handle JSON natively if you want to store data that way?

Yep that's definitely the case. Answer is that I don't know how to get the insights I want from that structure of data.

For example say I have a source with one PPP loan per row and the EIN of the company and another source with EIN number of the company and the number of employees it has. My goal might be to figure out the PPP loan dollars/employee by summing the total of the loans by company then divide that by the other table's number of employees. BigQuery's SQL editor makes this an approachable problem because theres a TON of documentation for a self taught Python doesnt-know-much about DBs person like myself. Except in my case theres rarely a nice overlapping shared key like EIN and I am often using regexs to account for times where a period or dash is missing from one db but present in another.

Not sure how I'd go about doing that kind of thing with a document DB.

EDIT: Watching this now and it looks like this is pretty possible, though my data layout might need to change.

https://www.youtube.com/watch?v=STo98QUKDS8

I might just need to build a test database. I havent gotten to this part of the vid yet but it looks like they're going to have a way to have like address.abc_mailing_address_4 and while in a table view it'd look like a bunch of nulls it isnt. Hopefully they'll also say I can query it using my already existing queries but use address.* to search every address, regardless of the key name.

CarForumPoster fucked around with this message at 17:45 on Jun 6, 2022

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply