|
Hughmoris posted:Had the interview. Couple of thought/questions: Just careful that you don’t accept a glorified report writer position. Frankly I prefer the ETL side far far more and its def the more tech / well paid side.
|
# ? Jul 4, 2020 03:06 |
|
|
# ? May 15, 2024 11:20 |
|
is there a good free db or table for all postal/zip/mail codes in the world? i'm also not sure if this stuff changes? i'm american and really only know this system. it feels like it never changes? maybe the british do?
|
# ? Jul 4, 2020 03:07 |
|
abelwingnut posted:is there a good free db or table for all postal/zip/mail codes in the world? Addresses are the second biggest clusterfuck in computer science, right after time. Even if a specific country doesn't change often there are nearly 200 countries. So as a whole they change all the time. Postal codes in many countries are unique or near-unique identifiers so they'll change constantly. Ruggan posted:Not a Postgres expert (I do mostly MS SQL), but maybe do something like this? The query runs through an ETL tool that does not support that. It has to be a single self-contained query.
|
# ? Jul 4, 2020 04:48 |
|
Just a quick check, because I have been writing a bunch of terrible SQL queries by hand, using subpar tools (MSQuery *cough*), against DB2 tables with expected lovely table and column naming schemes from the 80's, after ages of no SQL at all, while waiting to get proper gear at my new workplace (I'm currently using a terrible thin client with a restricted environment), so my head spinning and need a check whether I'm thinking right (not even sure why I'm doing it on a weekend). I've been trying to identify what people performed what operations in the factory by matching dates and times against attendances. The query looks loosely like this, and takes it's sweet time, relatively, and probably expected considering it has to loop the joined table a shitload: code:
I presume adding this to the ON statement should help a lot, because it can home in on the rows more easily? Or would DB2 optimize it that way internally regardless? code:
Also, DB2 has a BETWEEN operator, does it optimize queries better than writing it out by hand?
|
# ? Jul 4, 2020 10:57 |
|
Combat Pretzel posted:Just a quick check, because I have been writing a bunch of terrible SQL queries by hand, using subpar tools (MSQuery *cough*), against DB2 tables with expected lovely table and column naming schemes from the 80's, after ages of no SQL at all, while waiting to get proper gear at my new workplace (I'm currently using a terrible thin client with a restricted environment), so my head spinning and need a check whether I'm thinking right (not even sure why I'm doing it on a weekend). That join will probably help, because it gives it the potential to do a better join type than endlessly looping. Because you're doing a LEFT join you should put it in the ON because otherwise you're removing the LEFTness by demanding that an o date matches a punch date. Between is normally just syntactical sugar for x >= foo and <= bar, it shouldn't make any difference to your results at all.
|
# ? Jul 4, 2020 11:38 |
|
Cool, thanks. Are there any query builders that show column and table descriptions in lists, dropdown/combo boxes or while auto-completion? My coworker has ShowCase Strategy Query, which does that, but he's been looking for a more up-to-date tool, because the app is pretty dated nowadays, and it seems none of others appear to be doing any of that. Needs to be able to run on Windows 10. Once I get my dedicated workstation, I'd also like some newer tooling.
|
# ? Jul 5, 2020 11:36 |
|
Thanks to Lodge Norths key store I am setting up a home lab. I just installed SQL server 2019 developer edition. I have enabled tcp/ip , in all three segments of the SQL server configuration window. Windows firewall is disabled on both server and client computers. The sql user is in both the server logins and the DB logins. The user has owner, datareader and the data writer. Windows logins work l, but not SQL user. The sql server is installed on windows server 2016. The client is windows 10 pro. What have I missed that is blocking SQL user access? joebuddah fucked around with this message at 17:50 on Jul 5, 2020 |
# ? Jul 5, 2020 17:37 |
|
joebuddah posted:Thanks to Lodge Norths key store I am setting up a home lab. Do you have mixed mode authentication enabled?
|
# ? Jul 5, 2020 17:53 |
|
Sir Bobert Fishbone posted:Do you have mixed mode authentication enabled? Thanks that was it. I forgot that you had to change that setting. That was driving me crazy
|
# ? Jul 5, 2020 18:47 |
|
abelwingnut posted:is there a good free db or table for all postal/zip/mail codes in the world? https://www.geonames.org/
|
# ? Jul 6, 2020 21:37 |
|
Got a data modeling question to bounce off you guys. We have Projects, each of which can optionally have one Budget representing their daily limit of money to spend, and each Budget can have many Budget Events, representing a historical store of spending events for that given budget. Each Project also has many State Projects, one for each US State. A State Project can also optionally have a Budget, with associated Budget Events. The State Project's Budget and Budget Events take priority over the Project's Budget and Budget Events for spending operations. We want to check how much we've spent on a State Project based on: - the State Project's Budget and Budget Events, or if it is not present, - the Project's Budget and Budget Events My reasoning is that there's two ways to do this: 1. You can make two tables per budget type i.e. `project_budgets` and `project_budget_events`, and `state_project_budgets` and `state_project_budget_events`, or 2. You can make one table for all budgets i.e. `budgets` and `budget_events`, and relate them to Projects and State Projects via polymorphic associations ala `budgets.budgetable_type['Project'|'State Project']`. Both of them have advantages and disadvantages, and I'm not even sure either one is an ideal way to solve this problem anyway. What's a reasonable data model for answering the question "can we spend $X.XX on State Project Y" if we need to use the Project Budget when a State Project doesn't have a budget? i.e., fallback behavior?
|
# ? Jul 8, 2020 16:48 |
|
Hello thread. I have an issue that's pretty minor but thoroughly annoying when it happens. I'm sure it's more of a MS SQL Server 17 question than anything to do with the language itself, but thought I'd ask anyway: Why is it that when I save the output (as .csv or .txt) of my queries, it sometimes saves with column headers and sometimes doesn't? Seems to be no rhyme or reason to which one will be the case.
|
# ? Jul 8, 2020 21:36 |
|
Trabant posted:Hello thread. I have an issue that's pretty minor but thoroughly annoying when it happens. I'm sure it's more of a MS SQL Server 17 question than anything to do with the language itself, but thought I'd ask anyway: I’m assuming you’re using SSMS - it’s always been super random whether it decides to or not, so I just plan for it to do so when I’m copying and pasting. If I need something specifically without a header row consistently and run the job repeatedly, I run it through SSIS/Visual Studio as a package and specifically configure it to skip the header row.
|
# ? Jul 9, 2020 02:22 |
|
blammo, danke.
|
# ? Jul 9, 2020 03:30 |
|
duffmensch posted:Im assuming youre using SSMS - its always been super random whether it decides to or not, so I just plan for it to do so when Im copying and pasting. Sorry -- good assumption, I am indeed using SSMS. It's kinda good to hear that I'm not crazy, but it's also that it's something that happens to others. Just... how? -------------------------------------- Since I barged into the thread with silly questions, here's another one that I thought was router-related, but thought I'd run it by y'all too. Short version: as I'm working from home I noticed that "big" SQL queries (say 50k rows or more) seem to cause my router to become unresponsive. Quoting myself from the Networking thread: Trabant posted:Popping in for some help, again, please, because I have a weird thing happening. Here's my home network: For clarification, I only run SQL from my laptop.
|
# ? Jul 9, 2020 03:42 |
|
also, quick question. i have a junction table connecting two other tables. as it stands the columns for the junction table are: code:
furthermore, anything else i should alter about this? do i have the foreign keys like i want? i need to add constraints for cascading and all that, but just wanting to get the primary key and base setup first. thanks.
|
# ? Jul 9, 2020 04:22 |
|
abelwingnut posted:also, quick question. You’re talking about a “composite primary key”. Advantages being they are natively indexed in the way your lookups probably need them to be, no need for additional columns, no need to add another unique constraint. Disadvantages being there is no sole identifier, inserts can cause page splits to fit the data. I personally use composite primary keys for your junction tables without additional columns and a standard auto incrementing pk for junction tables that contain other columns than the fks. In a flat junction table with a single identity pk you’ll probably want supporting indexes and a uniqueness constraint anyway. The single pk only really comes in handy when you’re doing operations against a single row in the table (eg an update of a specific row to set another column’s value). Look composite primary keys up. There are schools of thought that feel very strongly about either choice.
|
# ? Jul 9, 2020 05:34 |
Pollyanna posted:Got a data modeling question to bounce off you guys. If budgets/budget events are identically structured for projects and state projects, share the table. In this case, the (state) project can own the relationship, like so: Project: id, budget_id State Project: id, project_id, budget_id Budget: id Budget Event: id, budget_id Then, if your State Project budget_id is null, you reference the budget_id from the root Project. Budgets themselves don't know anything about being for a state or root project, they just handle budget stuff
|
|
# ? Jul 9, 2020 18:03 |
|
Has anyone else come across the practice of using WHERE TRUE in every query? I think the idea is that this allows every other clause in the WHERE section to begin with AND. The consistency allows each clause to be easily moved around or commented out without needing to switch between WHERE and AND. Not sure if it affects performance, just thought it was interesting.
|
# ? Jul 13, 2020 05:01 |
|
Vegetable posted:Has anyone else come across the practice of using WHERE TRUE in every query? I’ve seen it. Mostly in dynamic sql with layered on where clauses. I haven’t seen it much in non-dynamic sql and don’t really see a reason for it to exist there...
|
# ? Jul 13, 2020 07:43 |
|
Vegetable posted:Not sure if it affects performance, just thought it was interesting.
|
# ? Jul 14, 2020 06:22 |
|
I have a novice DB choice question. I'd like to choose something that will work for the next 2-3 years. Let me know if these a better spot for this. I'm working on a project that scrapes or FOIA requests data on civil lawsuits and stores the data in a Postgres DB (RDS) with one lawsuit per row. I access this data in several ways searching/filtering and returning results via a paginated REST API by attorney, type of case, location, court, case number, UUID, date, etc. which are each their own column. There are about 60 columns that contain fairly small amount of data and one "everything else" column which is json string (Mostly ~5kb, sometimes >100kb, rarely >1MB) for data that doesn't go in any other column that I often return but don't ever search by. A major data set I'm adding will make this ~7M rows long and add 5,000 rows/day (1.8M rows/year) on average. I also update the rows as things change in the cases, updating ~ 1000 rows/day. I serve the content to a web app that gets very little traffic as its behind a login. Should I be doing something different at this point? I'm worried that it'll take a several seconds to run even well filtered queries and that something NoSQL like DynamoDB won't allow me to do the flexible searching that is key to this project. CarForumPoster fucked around with this message at 16:55 on Jul 16, 2020 |
# ? Jul 16, 2020 16:49 |
|
CarForumPoster posted:I have a novice DB choice question. I'd like to choose something that will work for the next 2-3 years. Look at moving the big ol' json string to a different table and having a key to join them. You might see performance gains, you might not. Can't there be multiple attorneys per case? How are you handling that in one row? I would also look at creating partitions for the table based on a filing date, or some other date that never changes. Then default to a time limited search. This will split up your indexes by partition and avoid searching the entire history when most of the time people should know roughly when a case was filed.
|
# ? Jul 16, 2020 17:15 |
|
Are attorneys, courts, and locations encoded in a consistent manner? If they are, you may want to split them off into reference tables, which can improve performance and will let you offer search suggestions / listboxes in the web UI. It will also make it easier to support 1:N or M:N relationships if it becomes necessary. If they're not, you might want to play around with PG's built-in fuzzy text matching modules, to provide a better search experience than a basic LIKE %?%.
|
# ? Jul 16, 2020 18:38 |
|
theory question: let's say i've got two tables, [neighborhood] and [neighborhoodMap]: -- CREATE TABLE neighborhood ( -- id INT NOT NULL AUTO_INCREMENT, -- idGeoCity INT NOT NULL, -- neighborhood VARCHAR(100) NOT NULL, -- descripNeighborhood VARCHAR(1000), -- PRIMARY KEY (id), -- FOREIGN KEY (idGeoCity) REFERENCES geoCity(id) -- ) ENGINE=INNODB -- ; -- CREATE TABLE neighborhoodMap ( -- id INT NOT NULL AUTO_INCREMENT, -- idNeighborhood INT NOT NULL, -- timeCreated datetime NOT NULL, -- timeLastUpdated datetime NOT NULL, -- boundary blob NOT NULL, -- boundaryFillColor VARCHAR(7) NOT NULL, -- boundaryBorderColor VARCHAR(7) NOT NULL, -- PRIMARY KEY (id), -- FOREIGN KEY (idNeighborhood) REFERENCES neighborhood(id) -- ) ENGINE=INNODB -- ; these two will always be 1-to-1: a row in [neighborhood] will only relate to one row in [neighborhoodMap], and that one row in [neighborhoodMap] would only relate to the one row in [neighborhood]. now, in theory, these could be combined into one table--duh. but as two tables, it does two things, at least in my mental image of the mechanics going on. one, it makes mental sense, in a compartmental way--map stuff in [neighborhoodMap], humanish info about the neighborhood in [neighborhood]. secondly, [neighborhood] info will be called a lot--it will populate dropdowns, filters, other things, whereas the map stuff won't necessarily be needed. i will say map related stuff will probably be needed 75% of the time. and the map stuff is where the bulk of information will be. that boundary blob holds a POLYGON, and that can get very, very big. so in my head, it makes sense to keep the much lighter information that's requested more in its own table, and the much bulkier stuff in its own. is this wrong? there are effectively two types of calls to these tables: a call that would just need the neighborhood names and descriptions and maybe some other descriptive stuff, for dropdowns and menus and the like, and a call that would need the map information. the latter calls would require a JOIN between [neighborhood] and [neighborhoodMap]. so no matter which request, you're always going through [neighborhood]. it's just that those calls that don't need map information, never, ever have to load the much bulkier columns. does this really matter? then again, any call that does require map info will have to use the JOIN. are 1-to-1 JOINs ultimately slower in this kind of scenario? which way do i go? e: i guess ultimately my question is, if two tables are 1-to-1, and one of them needs the other to be called, does it ever make sense to separate them into two objects that could be joined when needed? abelwingnut fucked around with this message at 19:27 on Jul 16, 2020 |
# ? Jul 16, 2020 19:08 |
|
Xae posted:Look at moving the big ol' json string to a different table and having a key to join them. You might see performance gains, you might not. Yes there can. Right now I have a bad system where I separate into a "lead attorney" and a list of other attorneys. This works for the cases I started this with but with the subsequent data, its not a great system. I'm tempted to avoid having a bunch of tables that reference each other though. If only because it makes the performance worse and the data analysis harder. NihilCredo posted:Are attorneys, courts, and locations encoded in a consistent manner? Its indeterminate how many attorneys and locations there are, so I'm not sure I could offer predefined values in the form of a reference table. If they aren't NULL, they're always encoded as strings which are human readable and in a pretty consistent format. I'll def check the fuzzy string matching link out. EDIT: That fuzzy string matching link is great and def a problem I will have. I do my best to standardize names when possible but obviously thats...impossible for human names acquired through scraping. CarForumPoster fucked around with this message at 19:34 on Jul 16, 2020 |
# ? Jul 16, 2020 19:31 |
|
CarForumPoster posted:Yes there can. Right now I have a bad system where I separate into a "lead attorney" and a list of other attorneys. This works for the cases I started this with but with the subsequent data, its not a great system. I'm tempted to avoid having a bunch of tables that reference each other though. If only because it makes the performance worse and the data analysis harder. It depends on how you access things, but look ups might improve your performance. Names as strings are big. Ints are small. The associative table between Case and Attorney will scan much faster than the Case table. TimestampTZ64(FilingDate)+Int32(CaseID)+Int32(AttorneyID)+Int32(AttorneyType) = 160 bits = 20 Bytes Postgres is decent at generating plans, so should do a scan on Attorney, then index look up on Case/Attorney Association table, then Partition/PK look up on Case. In that case the only table scan is on the Attorney table, which can't possibly be that big. It'll probably be in memory most of the time. From there it uses indexes to avoid scanning the big table. It seems dumb and extra effort, but Warehouses are the one place where bit size and that jazz still matters. Because a hundred bytes here or there times a few million or a billion rows adds up fast. When you're running that fuzzy logic search its better to run it on an Attorney table with 10,000 rows than the Case table with 10,000,000.
|
# ? Jul 16, 2020 20:50 |
|
Xae posted:It depends on how you access things, but look ups might improve your performance. I'm worried that was the argument that led to this nearly unusable API layout: https://www.courtlistener.com/api/rest-info/ Take a look at that diagram and try to figure out how to find all of "Bob Loblaw's" cases in I love courtlistener and talk with the owner every now and then so I should pick his brain on this, but man their API is an absolute BITCH to search flexibly. An attorney table will likely end up on the order of 1M rows for a 10M row case table. This is because of challenges with attorney attribution between courts, states, etc. For reference, the count of CL's "attorneys" endpoint is 6.4M for a 17M count "dockets" end point. CarForumPoster fucked around with this message at 01:13 on Jul 17, 2020 |
# ? Jul 17, 2020 00:59 |
|
CarForumPoster posted:I'm worried that was the argument that led to this nearly unusable API layout: https://www.courtlistener.com/api/rest-info/ Take a look at that diagram and try to figure out how to find all of "Bob Loblaw's" cases in Yeah, that is a bit over designed for your purposes. Build the data model out a few ways. Use a data generator to fill out the tables and then start querying it. A lot of these questions can only be answered by experimenting.
|
# ? Jul 17, 2020 05:05 |
|
ok, question. i have a dataset i want to export into an online service. they have a thing where you upload a csv to them. problem is my dataset's 230mb, and their upload system only handles csv 10mb or smaller. is there a way of exporting a table from sql server that divides the resulting csv into particular sizes? like, is there some way i can say take the first 10k rows, put that in csv1, take the next 10k rows, put those in csv2, etc. until all rows have been exported? e: just thought of a very manual way using offset and fetch. that'll ultimately work, but is there a more automatic, less annoying way? abelwingnut fucked around with this message at 01:23 on Jul 20, 2020 |
# ? Jul 20, 2020 01:00 |
|
What you probably want to do is make the big CSV, then use some tools (not the database) to split up that CSV. Better yet, write a little program that reads slightly less than 10m of data into a buffer, posts it to the service, then repeats until EOF
|
# ? Jul 20, 2020 02:20 |
|
yea, that makes sense. problem is i have no idea how to do that. would this be something easily done with python?
|
# ? Jul 20, 2020 02:40 |
|
abelwingnut posted:yea, that makes sense. problem is i have no idea how to do that. Yup, python would be quite suitable for the job.
|
# ? Jul 20, 2020 03:12 |
|
How do I avoid getting cartesian products in my results? I have parent_rec, child1, childA record types. A parent may have zero or one child1 type records and zero or one childA records, and I want to pivot these into a single flat record. by doing "select * from parent_rec left join child1 left join childA" I am getting the cartesian product back out, eg:
Instead I want just "parent1, child1, childA". I can't do an inner join since there may exist records of the type "parent1, null, childA" or "parent1, child1, null" that need to be retrieved. Paul MaudDib fucked around with this message at 19:02 on Jul 20, 2020 |
# ? Jul 20, 2020 18:58 |
|
Paul MaudDib posted:How do I avoid getting cartesian products in my results? Could you group by parent_rec then use aggregate functions (e.g. MAX()) to get your child data? If there's only at most 1 of each child type, MAX should either return the matching child value, or null if there is no child for that parent.
|
# ? Jul 20, 2020 19:19 |
|
Uh, what's your DB engine? On both MSSQL and Postgres, querying A left join B left join C will absolutely return exactly one record as long as neither B nor C have more than one of A's children. Are you doing the old-style joins, by b any chance? I.e instead of "from A left join B on A.id=B.id left join C on A.id=C.id", are you doing "from A, B, C where (something)"?
|
# ? Jul 20, 2020 19:43 |
|
What you describe isn't how outer joins work... http://sqlfiddle.com/#!17/4ac91/3 can you post your query in more detail and tell us what RDBMS you are using?
|
# ? Jul 20, 2020 19:52 |
|
NihilCredo posted:Uh, what's your DB engine? On both MSSQL and Postgres, querying A left join B left join C will absolutely return exactly one record as long as neither B nor C have more than one of A's children. Oracle 11g and I'm doing the "from A left join B on A.id=B.id left join C on A.id=C.id" syntax. I'm eliding some complexity here, we've got 4 tables, parent, many-to-many join-table, and two types of records attached via the join_table, so there are four joins here, parent to join-table, join-table to child1, parent to join-table, join-table to childA. I have nuked my test data but I'm pretty sure that's what was going on. The max thing might work though.
|
# ? Jul 20, 2020 19:57 |
|
There is much to hate about Oracle but I'm pretty sure outer joins work the way they are supposed to Do you have a minimal example of the behaviour If all your joins go via a many-to-many table then that could be obfuscating things
|
# ? Jul 20, 2020 20:00 |
|
|
# ? May 15, 2024 11:20 |
|
yo what's up YOSPOS long time no see. I have to get some data out of a few weirdly designed tables in postgres. We process tips for hairdressers and whatnot. We have a Ledger table where we track positive and negative additions to someone's balance. There's also a Payout table, and a payout is us sending money out, and a payout covers several tips, for several people. Once someone has been paid out, we add negative balance to their Ledger and mark it with a payoutId to know which payout resulted in that person's balance going down. So far so good. I'm writing some analytics for that cause the db guy is not currently available. I have to get the average number of tips per payout for one person. Now you would imagine that there would be a M2M table or something that tells you which tips were paid out in which payout. But there isn't. Apparently, the way to get all the Ledger entries that were paid out in one payout (say payoutId = 99) is to find the Ledger entry with payoutId = 98 and then just go up by ids until we get to a Ledger entry with payoutId = 99. so I have code:
|
# ? Jul 24, 2020 15:19 |