Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Hughmoris posted:

Had the interview. Couple of thought/questions:
  • Sounds like ETL team and reporting team are the same. I'd be doing a mixture of ETL work and reporting/visualizations
  • They are a Cerner shop, and I'd be working a lot with clinical data due to my clinical background
  • For ETL pipeline they said a lot of stuff. Sounds like they use Hadoop (I think she said that), and a Cerner EDW.
  • Anyone use Vertica? Sounds like that is their flavor of SQL.

Overall, sounded like it could be a cool gig. I'd have to brush up on my tech skills a little more. No on-call which is nice. 100% remote right now, will revisit options down the road as COVID progresses.

Just careful that you don’t accept a glorified report writer position. Frankly I prefer the ETL side far far more and its def the more tech / well paid side.

Adbot
ADBOT LOVES YOU

abelwingnut
Dec 23, 2002


is there a good free db or table for all postal/zip/mail codes in the world?

i'm also not sure if this stuff changes? i'm american and really only know this system. it feels like it never changes? maybe the british do?

Xae
Jan 19, 2005

abelwingnut posted:

is there a good free db or table for all postal/zip/mail codes in the world?

i'm also not sure if this stuff changes? i'm american and really only know this system. it feels like it never changes? maybe the british do?

Addresses are the second biggest clusterfuck in computer science, right after time.


Even if a specific country doesn't change often there are nearly 200 countries. So as a whole they change all the time.

Postal codes in many countries are unique or near-unique identifiers so they'll change constantly.

Ruggan posted:

Not a Postgres expert (I do mostly MS SQL), but maybe do something like this?

code:
DECLARE @maxDt as datetime = (SELECT max(TimestampB) FROM TableB)

SELECT *
  FROM TableA
 WHERE TimeStampA > @maxDt


The query runs through an ETL tool that does not support that. It has to be a single self-contained query.

Combat Pretzel
Jun 23, 2004

No, seriously... what kurds?!
Just a quick check, because I have been writing a bunch of terrible SQL queries by hand, using subpar tools (MSQuery *cough*), against DB2 tables with expected lovely table and column naming schemes from the 80's, after ages of no SQL at all, while waiting to get proper gear at my new workplace (I'm currently using a terrible thin client with a restricted environment), so my head spinning and need a check whether I'm thinking right (not even sure why I'm doing it on a weekend).

I've been trying to identify what people performed what operations in the factory by matching dates and times against attendances. The query looks loosely like this, and takes it's sweet time, relatively, and probably expected considering it has to loop the joined table a shitload:

code:
SELECT *
FROM
    operation o
    LEFT OUTER JOIN presence p
    ON
        (o.date * 1000000 + o.time) >= (p.punch_in_date * 1000000 + p.punch_in_time) AND
        (o.date * 1000000 + o.time) <= (p.punch_out_date * 1000000 + p.punch_out_time) AND
WHERE o.something = blah
(This datetime terribleness is technical debt.)

I presume adding this to the ON statement should help a lot, because it can home in on the rows more easily? Or would DB2 optimize it that way internally regardless?
code:
(o.date = p.punch_in_date OR o.date = p.punch_out_date)
(Considering shifts shouldn't be longer than 11 hours, checking entry and exit dates for graveyard shifts.)

Also, DB2 has a BETWEEN operator, does it optimize queries better than writing it out by hand?

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


Combat Pretzel posted:

Just a quick check, because I have been writing a bunch of terrible SQL queries by hand, using subpar tools (MSQuery *cough*), against DB2 tables with expected lovely table and column naming schemes from the 80's, after ages of no SQL at all, while waiting to get proper gear at my new workplace (I'm currently using a terrible thin client with a restricted environment), so my head spinning and need a check whether I'm thinking right (not even sure why I'm doing it on a weekend).

I've been trying to identify what people performed what operations in the factory by matching dates and times against attendances. The query looks loosely like this, and takes it's sweet time, relatively, and probably expected considering it has to loop the joined table a shitload:

code:
SELECT *
FROM
    operation o
    LEFT OUTER JOIN presence p
    ON
        (o.date * 1000000 + o.time) >= (p.punch_in_date * 1000000 + p.punch_in_time) AND
        (o.date * 1000000 + o.time) <= (p.punch_out_date * 1000000 + p.punch_out_time) AND
WHERE o.something = blah
(This datetime terribleness is technical debt.)

I presume adding this to the ON statement should help a lot, because it can home in on the rows more easily? Or would DB2 optimize it that way internally regardless?
code:
(o.date = p.punch_in_date OR o.date = p.punch_out_date)
(Considering shifts shouldn't be longer than 11 hours, checking entry and exit dates for graveyard shifts.)

Also, DB2 has a BETWEEN operator, does it optimize queries better than writing it out by hand?

That join will probably help, because it gives it the potential to do a better join type than endlessly looping. Because you're doing a LEFT join you should put it in the ON because otherwise you're removing the LEFTness by demanding that an o date matches a punch date.

Between is normally just syntactical sugar for x >= foo and <= bar, it shouldn't make any difference to your results at all.

Combat Pretzel
Jun 23, 2004

No, seriously... what kurds?!
Cool, thanks.

Are there any query builders that show column and table descriptions in lists, dropdown/combo boxes or while auto-completion? My coworker has ShowCase Strategy Query, which does that, but he's been looking for a more up-to-date tool, because the app is pretty dated nowadays, and it seems none of others appear to be doing any of that. Needs to be able to run on Windows 10.

Once I get my dedicated workstation, I'd also like some newer tooling.

joebuddah
Jan 30, 2005
Thanks to Lodge Norths key store I am setting up a home lab.

I just installed SQL server 2019 developer edition.

I have enabled tcp/ip , in all three segments of the SQL server configuration window.



Windows firewall is disabled on both server and client computers.

The sql user is in both the server logins and the DB logins. The user has owner, datareader and the data writer.

Windows logins work l, but not SQL user.

The sql server is installed on windows server 2016. The client is windows 10 pro.

What have I missed that is blocking SQL user access?

joebuddah fucked around with this message at 17:50 on Jul 5, 2020

Sir Bobert Fishbone
Jan 16, 2006

Beebort

joebuddah posted:

Thanks to Lodge Norths key store I am setting up a home lab.

I just installed SQL server 2019 developer edition.

I have enabled tcp/ip , in all three segments of the SQL server configuration window.



Windows firewall is disabled on both server and client computers.

The sql user is in both the server logins and the DB logins. The user has owner, datareader and the data writer.

Windows logins work l, but not SQL user.

The sql server is installed on windows server 2016. The client is windows 10 pro.

What have I missed that is blocking SQL user access?

Do you have mixed mode authentication enabled?

joebuddah
Jan 30, 2005

Sir Bobert Fishbone posted:

Do you have mixed mode authentication enabled?

Thanks that was it. I forgot that you had to change that setting. That was driving me crazy

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003

abelwingnut posted:

is there a good free db or table for all postal/zip/mail codes in the world?

i'm also not sure if this stuff changes? i'm american and really only know this system. it feels like it never changes? maybe the british do?

https://www.geonames.org/

Pollyanna
Mar 5, 2005

Milk's on them.


Got a data modeling question to bounce off you guys.

We have Projects, each of which can optionally have one Budget representing their daily limit of money to spend, and each Budget can have many Budget Events, representing a historical store of spending events for that given budget. Each Project also has many State Projects, one for each US State. A State Project can also optionally have a Budget, with associated Budget Events. The State Project's Budget and Budget Events take priority over the Project's Budget and Budget Events for spending operations.

We want to check how much we've spent on a State Project based on:

- the State Project's Budget and Budget Events, or if it is not present,
- the Project's Budget and Budget Events

My reasoning is that there's two ways to do this:

1. You can make two tables per budget type i.e. `project_budgets` and `project_budget_events`, and `state_project_budgets` and `state_project_budget_events`, or
2. You can make one table for all budgets i.e. `budgets` and `budget_events`, and relate them to Projects and State Projects via polymorphic associations ala `budgets.budgetable_type['Project'|'State Project']`.

Both of them have advantages and disadvantages, and I'm not even sure either one is an ideal way to solve this problem anyway. What's a reasonable data model for answering the question "can we spend $X.XX on State Project Y" if we need to use the Project Budget when a State Project doesn't have a budget? i.e., fallback behavior?

Trabant
Nov 26, 2011

All systems nominal.
Hello thread. I have an issue that's pretty minor but thoroughly annoying when it happens. I'm sure it's more of a MS SQL Server 17 question than anything to do with the language itself, but thought I'd ask anyway:

Why is it that when I save the output (as .csv or .txt) of my queries, it sometimes saves with column headers and sometimes doesn't? Seems to be no rhyme or reason to which one will be the case.

duffmensch
Feb 20, 2004

Duffman is thrusting in the direction of the problem!

Trabant posted:

Hello thread. I have an issue that's pretty minor but thoroughly annoying when it happens. I'm sure it's more of a MS SQL Server 17 question than anything to do with the language itself, but thought I'd ask anyway:

Why is it that when I save the output (as .csv or .txt) of my queries, it sometimes saves with column headers and sometimes doesn't? Seems to be no rhyme or reason to which one will be the case.

I’m assuming you’re using SSMS - it’s always been super random whether it decides to or not, so I just plan for it to do so when I’m copying and pasting.

If I need something specifically without a header row consistently and run the job repeatedly, I run it through SSIS/Visual Studio as a package and specifically configure it to skip the header row.

abelwingnut
Dec 23, 2002



blammo, danke.

Trabant
Nov 26, 2011

All systems nominal.

duffmensch posted:

I’m assuming you’re using SSMS - it’s always been super random whether it decides to or not, so I just plan for it to do so when I’m copying and pasting.

If I need something specifically without a header row consistently and run the job repeatedly, I run it through SSIS/Visual Studio as a package and specifically configure it to skip the header row.

Sorry -- good assumption, I am indeed using SSMS. It's kinda good to hear that I'm not crazy, but it's also :psyduck: that it's something that happens to others. Just... how?

--------------------------------------

Since I barged into the thread with silly questions, here's another one that I thought was router-related, but thought I'd run it by y'all too. Short version: as I'm working from home I noticed that "big" SQL queries (say 50k rows or more) seem to cause my router to become unresponsive. Quoting myself from the Networking thread:

Trabant posted:

Popping in for some help, again, please, because I have a weird thing happening. Here's my home network:

code:

	        ISP 
          (gigabit fiber)
                 |
                 |
            wired router 
           (Ubiquiti ER-X)
              /         \
             /           \
        desktop       wireless router 
                      (ASUS RT-N56U)
                            |
                            |
                         laptop

It seems that when I run some SQL scripts from my work laptop, it does ~*something*~ which causes the wired router to become completely unresponsive. The internet for the entire home dies, both wifi and wired, and I can't even reach the ER-X router through my desktop. The only solution is to cycle the power on the ER-X.

This doesn't happen every time nor for every SQL script, but it sure as hell seems to happen when I start queries. It never happens when I'm only using my personal desktop. I'll be doing some testing this weekend to see if I can replicate it with any consistency.

But so that I know if I'm completely insane: is this even possible? Could the router be having some kind of... I don't know, allergic loving reaction to the SQL code? Is it the spike in network traffic? What would you be looking for if you were trying to test this out?

For clarification, I only run SQL from my laptop.

abelwingnut
Dec 23, 2002


also, quick question.

i have a junction table connecting two other tables. as it stands the columns for the junction table are:

code:
CREATE TABLE junctionAtoB (
	id INT NOT NULL AUTO_INCREMENT,
	idA INT NOT NULL,
	idB INT NOT NULL,
	PRIMARY KEY (id).
	FOREIGN KEY (idA) REFERENCES A(id),
	FOREIGN KEY (idB) REFERENCES B(id)
) ENGINE=INNODB
;
is there a better way of doing this? i always have a row identifier (junctionAtoB.id here), and i always make that the primary key, but i'm seeing a lot of people do not use those in junction tables. instead they make the primary key (idA, idB). is there an advantage there? should i alter my practice of making the row identifier the primary key and starting using multiple columns? in A and B, it makes sense as these would be the main ways of joining. so i guess in that sense it makes less sense to do so in the junction table as i'd almost never be searching on it. just thinking aloud.

furthermore, anything else i should alter about this? do i have the foreign keys like i want? i need to add constraints for cascading and all that, but just wanting to get the primary key and base setup first.

thanks.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


abelwingnut posted:

also, quick question.

i have a junction table connecting two other tables. as it stands the columns for the junction table are:

code:
CREATE TABLE junctionAtoB (
	id INT NOT NULL AUTO_INCREMENT,
	idA INT NOT NULL,
	idB INT NOT NULL,
	PRIMARY KEY (id).
	FOREIGN KEY (idA) REFERENCES A(id),
	FOREIGN KEY (idB) REFERENCES B(id)
) ENGINE=INNODB
;
is there a better way of doing this? i always have a row identifier (junctionAtoB.id here), and i always make that the primary key, but i'm seeing a lot of people do not use those in junction tables. instead they make the primary key (idA, idB). is there an advantage there? should i alter my practice of making the row identifier the primary key and starting using multiple columns? in A and B, it makes sense as these would be the main ways of joining. so i guess in that sense it makes less sense to do so in the junction table as i'd almost never be searching on it. just thinking aloud.

furthermore, anything else i should alter about this? do i have the foreign keys like i want? i need to add constraints for cascading and all that, but just wanting to get the primary key and base setup first.

thanks.

You’re talking about a “composite primary key”.

Advantages being they are natively indexed in the way your lookups probably need them to be, no need for additional columns, no need to add another unique constraint.

Disadvantages being there is no sole identifier, inserts can cause page splits to fit the data.

I personally use composite primary keys for your junction tables without additional columns and a standard auto incrementing pk for junction tables that contain other columns than the fks. In a flat junction table with a single identity pk you’ll probably want supporting indexes and a uniqueness constraint anyway. The single pk only really comes in handy when you’re doing operations against a single row in the table (eg an update of a specific row to set another column’s value).

Look composite primary keys up. There are schools of thought that feel very strongly about either choice.

ChickenWing
Jul 22, 2010

:v:

Pollyanna posted:

Got a data modeling question to bounce off you guys.

We have Projects, each of which can optionally have one Budget representing their daily limit of money to spend, and each Budget can have many Budget Events, representing a historical store of spending events for that given budget. Each Project also has many State Projects, one for each US State. A State Project can also optionally have a Budget, with associated Budget Events. The State Project's Budget and Budget Events take priority over the Project's Budget and Budget Events for spending operations.

We want to check how much we've spent on a State Project based on:

- the State Project's Budget and Budget Events, or if it is not present,
- the Project's Budget and Budget Events

My reasoning is that there's two ways to do this:

1. You can make two tables per budget type i.e. `project_budgets` and `project_budget_events`, and `state_project_budgets` and `state_project_budget_events`, or
2. You can make one table for all budgets i.e. `budgets` and `budget_events`, and relate them to Projects and State Projects via polymorphic associations ala `budgets.budgetable_type['Project'|'State Project']`.

Both of them have advantages and disadvantages, and I'm not even sure either one is an ideal way to solve this problem anyway. What's a reasonable data model for answering the question "can we spend $X.XX on State Project Y" if we need to use the Project Budget when a State Project doesn't have a budget? i.e., fallback behavior?

If budgets/budget events are identically structured for projects and state projects, share the table. In this case, the (state) project can own the relationship, like so:

Project: id, budget_id
State Project: id, project_id, budget_id
Budget: id
Budget Event: id, budget_id

Then, if your State Project budget_id is null, you reference the budget_id from the root Project. Budgets themselves don't know anything about being for a state or root project, they just handle budget stuff

Vegetable
Oct 22, 2010

Has anyone else come across the practice of using WHERE TRUE in every query?

I think the idea is that this allows every other clause in the WHERE section to begin with AND. The consistency allows each clause to be easily moved around or commented out without needing to switch between WHERE and AND.

Not sure if it affects performance, just thought it was interesting.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Vegetable posted:

Has anyone else come across the practice of using WHERE TRUE in every query?

I think the idea is that this allows every other clause in the WHERE section to begin with AND. The consistency allows each clause to be easily moved around or commented out without needing to switch between WHERE and AND.

Not sure if it affects performance, just thought it was interesting.

I’ve seen it. Mostly in dynamic sql with layered on where clauses. I haven’t seen it much in non-dynamic sql and don’t really see a reason for it to exist there...

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Vegetable posted:

Not sure if it affects performance, just thought it was interesting.
EXPLAIN SELECT * FROM questions WHERE TRUE AND type='stupid';

CarForumPoster
Jun 26, 2013

⚡POWER⚡
I have a novice DB choice question. I'd like to choose something that will work for the next 2-3 years.

Let me know if these a better spot for this.

I'm working on a project that scrapes or FOIA requests data on civil lawsuits and stores the data in a Postgres DB (RDS) with one lawsuit per row. I access this data in several ways searching/filtering and returning results via a paginated REST API by attorney, type of case, location, court, case number, UUID, date, etc. which are each their own column. There are about 60 columns that contain fairly small amount of data and one "everything else" column which is json string (Mostly ~5kb, sometimes >100kb, rarely >1MB) for data that doesn't go in any other column that I often return but don't ever search by.

A major data set I'm adding will make this ~7M rows long and add 5,000 rows/day (1.8M rows/year) on average. I also update the rows as things change in the cases, updating ~ 1000 rows/day. I serve the content to a web app that gets very little traffic as its behind a login.

Should I be doing something different at this point? I'm worried that it'll take a several seconds to run even well filtered queries and that something NoSQL like DynamoDB won't allow me to do the flexible searching that is key to this project.

CarForumPoster fucked around with this message at 16:55 on Jul 16, 2020

Xae
Jan 19, 2005

CarForumPoster posted:

I have a novice DB choice question. I'd like to choose something that will work for the next 2-3 years.

Let me know if these a better spot for this.

I'm working on a project that scrapes or FOIA requests data on civil lawsuits and stores the data in a Postgres DB (RDS) with one lawsuit per row. I access this data in several ways searching/filtering and returning results via a paginated REST API by attorney, type of case, location, court, case number, UUID, date, etc. which are each their own column. There are about 60 columns that contain fairly small amount of data and one "everything else" column which is json string (Mostly ~5kb, sometimes >100kb, rarely >1MB) for data that doesn't go in any other column that I often return but don't ever search by.

A major data set I'm adding will make this ~7M rows long and add 5,000 rows/day (1.8M rows/year) on average. I also update the rows as things change in the cases, updating ~ 1000 rows/day. I serve the content to a web app that gets very little traffic as its behind a login.

Should I be doing something different at this point? I'm worried that it'll take a several seconds to run even well filtered queries and that something NoSQL like DynamoDB won't allow me to do the flexible searching that is key to this project.

Look at moving the big ol' json string to a different table and having a key to join them. You might see performance gains, you might not.

Can't there be multiple attorneys per case? How are you handling that in one row?

I would also look at creating partitions for the table based on a filing date, or some other date that never changes. Then default to a time limited search. This will split up your indexes by partition and avoid searching the entire history when most of the time people should know roughly when a case was filed.

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Are attorneys, courts, and locations encoded in a consistent manner?

If they are, you may want to split them off into reference tables, which can improve performance and will let you offer search suggestions / listboxes in the web UI. It will also make it easier to support 1:N or M:N relationships if it becomes necessary.

If they're not, you might want to play around with PG's built-in fuzzy text matching modules, to provide a better search experience than a basic LIKE %?%.

abelwingnut
Dec 23, 2002


theory question:

let's say i've got two tables, [neighborhood] and [neighborhoodMap]:

-- CREATE TABLE neighborhood (
-- id INT NOT NULL AUTO_INCREMENT,
-- idGeoCity INT NOT NULL,
-- neighborhood VARCHAR(100) NOT NULL,
-- descripNeighborhood VARCHAR(1000),
-- PRIMARY KEY (id),
-- FOREIGN KEY (idGeoCity) REFERENCES geoCity(id)
-- ) ENGINE=INNODB
-- ;
-- CREATE TABLE neighborhoodMap (
-- id INT NOT NULL AUTO_INCREMENT,
-- idNeighborhood INT NOT NULL,
-- timeCreated datetime NOT NULL,
-- timeLastUpdated datetime NOT NULL,
-- boundary blob NOT NULL,
-- boundaryFillColor VARCHAR(7) NOT NULL,
-- boundaryBorderColor VARCHAR(7) NOT NULL,
-- PRIMARY KEY (id),
-- FOREIGN KEY (idNeighborhood) REFERENCES neighborhood(id)
-- ) ENGINE=INNODB
-- ;

these two will always be 1-to-1: a row in [neighborhood] will only relate to one row in [neighborhoodMap], and that one row in [neighborhoodMap] would only relate to the one row in [neighborhood].

now, in theory, these could be combined into one table--duh. but as two tables, it does two things, at least in my mental image of the mechanics going on. one, it makes mental sense, in a compartmental way--map stuff in [neighborhoodMap], humanish info about the neighborhood in [neighborhood].

secondly, [neighborhood] info will be called a lot--it will populate dropdowns, filters, other things, whereas the map stuff won't necessarily be needed. i will say map related stuff will probably be needed 75% of the time. and the map stuff is where the bulk of information will be. that boundary blob holds a POLYGON, and that can get very, very big.

so in my head, it makes sense to keep the much lighter information that's requested more in its own table, and the much bulkier stuff in its own. is this wrong?

there are effectively two types of calls to these tables: a call that would just need the neighborhood names and descriptions and maybe some other descriptive stuff, for dropdowns and menus and the like, and a call that would need the map information. the latter calls would require a JOIN between [neighborhood] and [neighborhoodMap].

so no matter which request, you're always going through [neighborhood]. it's just that those calls that don't need map information, never, ever have to load the much bulkier columns. does this really matter?

then again, any call that does require map info will have to use the JOIN. are 1-to-1 JOINs ultimately slower in this kind of scenario?

which way do i go?

e: i guess ultimately my question is, if two tables are 1-to-1, and one of them needs the other to be called, does it ever make sense to separate them into two objects that could be joined when needed?

abelwingnut fucked around with this message at 19:27 on Jul 16, 2020

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Xae posted:

Look at moving the big ol' json string to a different table and having a key to join them. You might see performance gains, you might not.

Can't there be multiple attorneys per case? How are you handling that in one row?

I would also look at creating partitions for the table based on a filing date, or some other date that never changes. Then default to a time limited search. This will split up your indexes by partition and avoid searching the entire history when most of the time people should know roughly when a case was filed.

Yes there can. Right now I have a bad system where I separate into a "lead attorney" and a list of other attorneys. This works for the cases I started this with but with the subsequent data, its not a great system. I'm tempted to avoid having a bunch of tables that reference each other though. If only because it makes the performance worse and the data analysis harder.


NihilCredo posted:

Are attorneys, courts, and locations encoded in a consistent manner?

If they are, you may want to split them off into reference tables, which can improve performance and will let you offer search suggestions / listboxes in the web UI. It will also make it easier to support 1:N or M:N relationships if it becomes necessary.

If they're not, you might want to play around with PG's built-in fuzzy text matching modules, to provide a better search experience than a basic LIKE %?%.

Its indeterminate how many attorneys and locations there are, so I'm not sure I could offer predefined values in the form of a reference table. If they aren't NULL, they're always encoded as strings which are human readable and in a pretty consistent format.

I'll def check the fuzzy string matching link out.

EDIT: That fuzzy string matching link is great and def a problem I will have. I do my best to standardize names when possible but obviously thats...impossible for human names acquired through scraping.

CarForumPoster fucked around with this message at 19:34 on Jul 16, 2020

Xae
Jan 19, 2005

CarForumPoster posted:

Yes there can. Right now I have a bad system where I separate into a "lead attorney" and a list of other attorneys. This works for the cases I started this with but with the subsequent data, its not a great system. I'm tempted to avoid having a bunch of tables that reference each other though. If only because it makes the performance worse and the data analysis harder.


Its indeterminate how many attorneys and locations there are, so I'm not sure I could offer predefined values in the form of a reference table. If they aren't NULL, they're always encoded as strings which are human readable and in a pretty consistent format.

I'll def check the fuzzy string matching link out.

EDIT: That fuzzy string matching link is great and def a problem I will have. I do my best to standardize names when possible but obviously thats...impossible for human names acquired through scraping.

It depends on how you access things, but look ups might improve your performance.

Names as strings are big. Ints are small. The associative table between Case and Attorney will scan much faster than the Case table.

TimestampTZ64(FilingDate)+Int32(CaseID)+Int32(AttorneyID)+Int32(AttorneyType) = 160 bits = 20 Bytes

Postgres is decent at generating plans, so should do a scan on Attorney, then index look up on Case/Attorney Association table, then Partition/PK look up on Case.

In that case the only table scan is on the Attorney table, which can't possibly be that big. It'll probably be in memory most of the time. From there it uses indexes to avoid scanning the big table.

It seems dumb and extra effort, but Warehouses are the one place where bit size and that jazz still matters. Because a hundred bytes here or there times a few million or a billion rows adds up fast.

When you're running that fuzzy logic search its better to run it on an Attorney table with 10,000 rows than the Case table with 10,000,000.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Xae posted:

It depends on how you access things, but look ups might improve your performance.

[...]

In that case the only table scan is on the Attorney table, which can't possibly be that big. It'll probably be in memory most of the time. From there it uses indexes to avoid scanning the big table.

It seems dumb and extra effort, but Warehouses are the one place where bit size and that jazz still matters. Because a hundred bytes here or there times a few million or a billion rows adds up fast.

When you're running that fuzzy logic search its better to run it on an Attorney table with 10,000 rows than the Case table with 10,000,000.

I'm worried that was the argument that led to this nearly unusable API layout: https://www.courtlistener.com/api/rest-info/ Take a look at that diagram and try to figure out how to find all of "Bob Loblaw's" cases in

I love courtlistener and talk with the owner every now and then so I should pick his brain on this, but man their API is an absolute BITCH to search flexibly.

An attorney table will likely end up on the order of 1M rows for a 10M row case table. This is because of challenges with attorney attribution between courts, states, etc.

For reference, the count of CL's "attorneys" endpoint is 6.4M for a 17M count "dockets" end point.

CarForumPoster fucked around with this message at 01:13 on Jul 17, 2020

Xae
Jan 19, 2005

CarForumPoster posted:

I'm worried that was the argument that led to this nearly unusable API layout: https://www.courtlistener.com/api/rest-info/ Take a look at that diagram and try to figure out how to find all of "Bob Loblaw's" cases in

I love courtlistener and talk with the owner every now and then so I should pick his brain on this, but man their API is an absolute BITCH to search flexibly.

An attorney table will likely end up on the order of 1M rows for a 10M row case table. This is because of challenges with attorney attribution between courts, states, etc.

For reference, the count of CL's "attorneys" endpoint is 6.4M for a 17M count "dockets" end point.

Yeah, that is a bit over designed for your purposes.

Build the data model out a few ways. Use a data generator to fill out the tables and then start querying it.

A lot of these questions can only be answered by experimenting.

abelwingnut
Dec 23, 2002


ok, question.

i have a dataset i want to export into an online service. they have a thing where you upload a csv to them. problem is my dataset's 230mb, and their upload system only handles csv 10mb or smaller. is there a way of exporting a table from sql server that divides the resulting csv into particular sizes? like, is there some way i can say take the first 10k rows, put that in csv1, take the next 10k rows, put those in csv2, etc. until all rows have been exported?

e: just thought of a very manual way using offset and fetch. that'll ultimately work, but is there a more automatic, less annoying way?

abelwingnut fucked around with this message at 01:23 on Jul 20, 2020

spiritual bypass
Feb 19, 2008

Grimey Drawer
What you probably want to do is make the big CSV, then use some tools (not the database) to split up that CSV. Better yet, write a little program that reads slightly less than 10m of data into a buffer, posts it to the service, then repeats until EOF

abelwingnut
Dec 23, 2002


yea, that makes sense. problem is i have no idea how to do that. :(

would this be something easily done with python?

Volguus
Mar 3, 2009

abelwingnut posted:

yea, that makes sense. problem is i have no idea how to do that. :(

would this be something easily done with python?

Yup, python would be quite suitable for the job.

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE
How do I avoid getting cartesian products in my results?

I have parent_rec, child1, childA record types. A parent may have zero or one child1 type records and zero or one childA records, and I want to pivot these into a single flat record.

by doing "select * from parent_rec left join child1 left join childA" I am getting the cartesian product back out, eg:

  • parent1, child1, null
  • parent1, null, childA
  • parent1, child1, childA

Instead I want just "parent1, child1, childA".

I can't do an inner join since there may exist records of the type "parent1, null, childA" or "parent1, child1, null" that need to be retrieved.

Paul MaudDib fucked around with this message at 19:02 on Jul 20, 2020

Splinter
Jul 4, 2003
Cowabunga!

Paul MaudDib posted:

How do I avoid getting cartesian products in my results?

I have parent_rec, child1, childA record types. A parent may have zero or one child1 type records and zero or one childA records, and I want to pivot these into a single flat record.

by doing "select * from parent_rec left join child1 left join childA" I am getting the cartesian product back out, eg:

  • parent1, child1, null
  • parent1, null, childA
  • parent1, child1, childA

Instead I want just "parent1, child1, childA".

I can't do an inner join since there may exist records of the type "parent1, null, childA" or "parent1, child1, null" that need to be retrieved.

Could you group by parent_rec then use aggregate functions (e.g. MAX()) to get your child data? If there's only at most 1 of each child type, MAX should either return the matching child value, or null if there is no child for that parent.

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Uh, what's your DB engine? On both MSSQL and Postgres, querying A left join B left join C will absolutely return exactly one record as long as neither B nor C have more than one of A's children.

Are you doing the old-style joins, by b any chance? I.e instead of "from A left join B on A.id=B.id left join C on A.id=C.id", are you doing "from A, B, C where (something)"?

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
What you describe isn't how outer joins work...

http://sqlfiddle.com/#!17/4ac91/3

can you post your query in more detail and tell us what RDBMS you are using?

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

NihilCredo posted:

Uh, what's your DB engine? On both MSSQL and Postgres, querying A left join B left join C will absolutely return exactly one record as long as neither B nor C have more than one of A's children.

Are you doing the old-style joins, by b any chance? I.e instead of "from A left join B on A.id=B.id left join C on A.id=C.id", are you doing "from A, B, C where (something)"?

Oracle 11g and I'm doing the "from A left join B on A.id=B.id left join C on A.id=C.id" syntax.

I'm eliding some complexity here, we've got 4 tables, parent, many-to-many join-table, and two types of records attached via the join_table, so there are four joins here, parent to join-table, join-table to child1, parent to join-table, join-table to childA.

I have nuked my test data but I'm pretty sure that's what was going on. The max thing might work though.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
There is much to hate about Oracle but I'm pretty sure outer joins work the way they are supposed to

Do you have a minimal example of the behaviour

If all your joins go via a many-to-many table then that could be obfuscating things

Adbot
ADBOT LOVES YOU

hey mom its 420
May 12, 2007

yo what's up YOSPOS long time no see.

I have to get some data out of a few weirdly designed tables in postgres. We process tips for hairdressers and whatnot. We have a Ledger table where we track positive and negative additions to someone's balance. There's also a Payout table, and a payout is us sending money out, and a payout covers several tips, for several people. Once someone has been paid out, we add negative balance to their Ledger and mark it with a payoutId to know which payout resulted in that person's balance going down.

So far so good. I'm writing some analytics for that cause the db guy is not currently available. I have to get the average number of tips per payout for one person. Now you would imagine that there would be a M2M table or something that tells you which tips were paid out in which payout. But there isn't. Apparently, the way to get all the Ledger entries that were paid out in one payout (say payoutId = 99) is to find the Ledger entry with payoutId = 98 and then just go up by ids until we get to a Ledger entry with payoutId = 99.

so I have
code:
Ledger or whatever
id    reason    amount    payoutId
9     tip       $4
8     payout    -$32      9
7     tip       $4
6     tip       $4
5     tip       $4
4     payout    -$52      8
3     tip       $4
2     payout    $3        7
And I need to get a result that tells me that payout 9 had 3 tips, payout 7 had 1 tip and so on for each payout. Or even better to annotate it with a column that says which payout the tip was paid out, then I can do grouping and aggreagates. Getting this would be pretty easy if the data were more normalized or If i were doing this in application code. but with sql this is just hosed

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply