|
Hughmoris posted:Question for those who use Postgres, before I go off and make a terrible tool. Is there anything that will read a CSV or Parquet file and auto-generate a .pgsql script that will create the tables and schemas for said CSV, so I can import it? Right now for learning purposes, I'm having to type out all the columns and types of said CSV before I can copy it with psql. It gets a bit tiring with multiple CSVs, or lots of columns. Dbeaver
|
# ? Aug 19, 2022 12:09 |
|
|
# ? May 11, 2024 07:54 |
|
drat I am megadumb. Thanks.
|
# ? Aug 19, 2022 12:20 |
|
I have to create a database for my job, but I'm unsure of the right way to design it. We have a bunch of different datasets with timeseries data and we want to gather them all together. I'm thinking of having either one big table with these columns: dataset location time energy_use uk xx australia yy Or a table listing all the datasets and then a different table dataset_table for each dataset with location time energy_use xx yy The second solution seems more elegant, but I can't figure out how to connect the dataset table to the different tables. edit: the goal is to be able to select only the data from a selected dataset. Maybe I can have a table BLA with: dataset dataset_table uk uk_table australia australia_table and do select * from (select dataset_table from BLA where dataset is uk) uguu fucked around with this message at 00:11 on Aug 24, 2022 |
# ? Aug 23, 2022 23:58 |
How large are those datasets, how many are there, and which DBMS will you be using?
|
|
# ? Aug 24, 2022 00:09 |
|
We'll be starting with about a dozen datasets, they are a couple of gigabyte each and we're using timescaleDB, which is postgresql. The goal is to create an opensource dataset of building energy usage, which researchers can access and add to eventually. I'm thinking I can have a table BLA with: dataset dataset_table uk uk_table australia australia_table and do select * from (select dataset_table from BLA where dataset is uk)
|
# ? Aug 24, 2022 00:15 |
That isn't valid SQL syntax. If the datasets are all homogenous they should probably go into one table. If you want to separate their storage you can maybe use table partitioning to split them up by dataset I'd.
|
|
# ? Aug 24, 2022 00:30 |
|
They're not homogeneous unfortunately, some come with additional columns. Also each dataset has auxiliary tables describing e.g. the weather for each point in time or building characteristics. And those especially are heterogeneous. Is there no way to link two tables, except by connecting on identical columns? I could do the table selecting in the API, I guess? Have the user select the UK dataset, then do select dataset _table from BLA, have that return uk_table and then create a query with select * from uk_table.
|
# ? Aug 24, 2022 01:12 |
|
Does a "dataset" contain more than one "location"? A good place to start is the PostgreSQL introduction/tutorial. The way to "link tables together" in SQL is with foreign keys. You need to understand is the relationships are one-to-one or one-to-many; many people start with ERDs (entity relationship diagrams) to see how things will "connect". The additional data must be dealt with also. Rows of a table contain the same columns, so variable interpretation of rows is not advised. If the extra days is sometimes just being retrieved, but never a part of a query/filter, then a common pattern is to stuff it into a json field.
|
# ? Aug 24, 2022 01:43 |
|
uguu posted:They're not homogeneous unfortunately, some come with additional columns. Also each dataset has auxiliary tables describing e.g. the weather for each point in time or building characteristics. And those especially are heterogeneous. Seems to me you're approaching the problem from the wrong end, or at least from an angle that makes a relational database sort of a poor fit for the task. If the goal is to be able to compare time series data from different data sets, the very first question you must ask yourself is what is actually important to compare. Then you design your database for comparing those common data points and figure out how to ingest the source data in a way that makes an apples to apples comparison possible. Only then can you start considering how to store and display any additional data. To me it seems though that the bigger issue here is that this might be a fundamentally ill-posed problem. Trying to compare datasets of different origins with seemingly no standard methodology for data gathering is fraught with peril and difficult even with a lot of manual fudging. It might honestly be better to just do the minimum possible processing of the data into a sort of minimum structured format, and toss all of the details into a JSON blob that it's up to whatever researcher is using it to try to interpret.
|
# ? Aug 24, 2022 02:39 |
|
Those are good points. I considered InfluxDB, then I could just throw everything into a bucket and wash my hands of it. But as little as I know of sql, i know nothing of flux and the manual wasn't very helpful. I will think about this a bunch more. Thansk for the advice!
|
# ? Aug 25, 2022 16:45 |
|
I mostly program Arduino stuff. I'd put myself at a level of "Advanced" I guess... I use VisualStudio with a plugin to make things easier, and I have written (very basic) libraries and stuff before. I mostly program solo. I'm self-taught, and don't really have a set of coding peers to ask questions like this to. I'd like to start using Github to backup my stuff more often and be able to revert when I mess up hard. I already have an account, and I've messed around with BitBucket and SourceTree a little before, but I've never really stuck with it very well. I like the GUI, but for the most part, I really just want an automated setup that I can ignore until something goes wrong. Is that unreasonable or stupid? Should I actively be more hands on? Right now, my file organizing is basically a main programming folder, with a series of sub folders organized by Job. Then, each job has sub-folders for individual programs/projects. After a job has been completed for at least a year, I throw the whole job folder into an Archive folder in that main programming parent directory. In my head, the ideal is that at the end of an editing session, when I close the file out, it uploads those changes automatically, with the option to push an update if I am doing a huge change that I'm unsure of. I don't particularly want to deal with branches, etc, as again, this is mostly solo development, and I rarely am doing experiments on two sections of code at once. Is this a good way to do it? What's the best way to implement this?
|
# ? Aug 26, 2022 17:12 |
|
Fanged Lawn Wormy posted:I mostly program Arduino stuff. I'd put myself at a level of "Advanced" I guess... I use VisualStudio with a plugin to make things easier, and I have written (very basic) libraries and stuff before. This doesn't have much to do with databases. Did you mean to post this in the general questions thread? The question you're asking here kind of straddles two things: version control and backing up your personal files. They are separate concerns, although to the extent that you can use a version control tool like Git to distribute changes to multiple machines or online services, it does offer some level of backup-like utility. If you are undertaking programming projects of any complexity then you owe it to yourself to use version control. It is worth the up-front effort to learn, and especially if you are working on your own there isn't really that much to learn. I don't think automatically submitting to version control when you close the editor is a good idea. You should be making a decision about when to do it, and it needs to make sense if you later come back to look at the history of what you did. "Finished adding the new button" makes sense, "it was 8pm and I had to put Timmy to bed" doesn't. It might sound like overhead but once you are used to it, it won't seem like that.
|
# ? Aug 26, 2022 18:17 |
|
edit: I truly am an idiot, wrong thread
BlackMK4 fucked around with this message at 21:29 on Aug 26, 2022 |
# ? Aug 26, 2022 21:02 |
|
Hammerite posted:This doesn't have much to do with databases. Did you mean to post this in the general questions thread? Hm. Thanks. Ill do some work and get it figured. Thanks for the help even if it was the wrong thread!
|
# ? Aug 29, 2022 15:56 |
|
I'm sure this will be kindergarten-level for most of you, but I'm trying to get the lowest salary, along with the department: https://www.db-fiddle.com/f/4EuiV3Pgbd7BTzRR4BKxZC/20 My impression was that if I don't put a GROUP BY, then it should give me a single result with the lowest overall salary, am I mistaken? If I don't put it, I get an error complaining about the lack of a GROUP BY.
|
# ? Aug 29, 2022 17:02 |
No, you can't get the smallest like that. Order by the salary ascending (smallest first), and get the first one row.
|
|
# ? Aug 29, 2022 17:06 |
|
Would something like this work for you? https://www.db-fiddle.com/f/7GDN7hqs4aE3s2qvCVvYR4/1 Hughmoris fucked around with this message at 17:17 on Aug 29, 2022 |
# ? Aug 29, 2022 17:12 |
|
nielsm posted:No, you can't get the smallest like that. I will try that, thank you! Hughmoris posted:Would something like this work for you? I'm just seeing the same thing that I posted
|
# ? Aug 29, 2022 17:15 |
|
Seventh Arrow posted:I will try that, thank you! Try this one: https://www.db-fiddle.com/f/7GDN7hqs4aE3s2qvCVvYR4/1
|
# ? Aug 29, 2022 17:17 |
|
That works, thanks! I would not have thought of using a subquery, very interesting.
|
# ? Aug 29, 2022 17:19 |
|
Seventh Arrow posted:I'm sure this will be kindergarten-level for most of you, but I'm trying to get the lowest salary, along with the department: When using any sort of summarization function, imagine the individual rows you're considering to be sheets of paper and your summarization to be a folder those papers are in. In this example, your names & salaries are all individual sheets of paper and your MIN function is your folder. As soon as you try to calculate that function, imagine writing the answer on the outside of that folder and closing it shut - now you only have access to what you wrote on the folder and cannot look at any of the details inside it. This is why your query fails, as you're trying to return both aggregated information and non-aggregated information. Your query will run just fine if the only thing you try to select is the minimum salary without the details. If you need to get the department along with the salary, you can use a subquery, select TOP 1 and order by the salary, or you can look into CTEs for more advanced stuff
|
# ? Aug 29, 2022 17:22 |
|
kumba posted:When using any sort of summarization function, imagine the individual rows you're considering to be sheets of paper and your summarization to be a folder those papers are in. In this example, your names & salaries are all individual sheets of paper and your MIN function is your folder. This is a good analogy, thanks! I will mull it over some.
|
# ? Aug 29, 2022 17:31 |
|
with postgres ltree, is there a pattern that allows you to store normalized data (no duplicate values) along trunk/leaf nodes, with lower nodes "overwriting" values higher in the tree? so for example if ltree address "a" has a column val:1, "a.b" has null, and "a.b.c" has val:3, I basically want a way to do coalesce(3, null, 1). Is that inbuilt somewhere or what would be an efficient way to do that? Paul MaudDib fucked around with this message at 17:48 on Sep 6, 2022 |
# ? Sep 6, 2022 17:44 |
|
That seems like an unusual use of ltree (which is itself somewhat obscure). I don't think ltree will enforce distinct values, my only guess there would be maybe try a unique index, not sure how it would actually work though. But also this doesn't seem particularly tree-structured to me. What if a.b.d also exists, or a.e? Which value takes precedence? If this case will never occur, then could you get away with just using an array and taking the first or last element of the array, depending on how you store it?
|
# ? Sep 6, 2022 21:58 |
|
No, like I want to select leaf nodes in a tree, such that if you walk the tree from the root to the leaves, column values from parents are overridden by children values (if non null) as you walk down the tree. Which I think is the answer, select leaf nodes, inner join a CTE that builds a parent row for each intermediate lpath on the tree. Which, of course, is a prime candidate for a materialized view I’d think. Does Postgres do partially/dynamically rematerialized views these days?
|
# ? Sep 7, 2022 07:09 |
|
I have an Oracle 19 database (hosted on AWS as an RDS Oracle instance) with a table into which I am trying to load data from a windows box using sqlldr.exe. The table is has five columns: code:
code:
code:
I've tried changing the nls_date_format to YYYY-MM-DD on the host using a parameter file per AWS instructions but that didn't change anything. I'm looking at the log file and it has this entry: code:
Agrikk fucked around with this message at 18:56 on Sep 7, 2022 |
# ? Sep 7, 2022 18:23 |
|
It was the sql loader control filecode:
|
# ? Sep 7, 2022 20:40 |
|
Not got a db to hand, but you were also falling into the trap of Oracle Date not being Timestamp. That data would go into a Date field just fine because they go to the second, timestamp is for sub-second and/or timezone data. This means there's also nls_timestamp_format, for setting what that should look like.
|
# ? Sep 8, 2022 17:03 |
|
I'm trying to work out a SQL problem from a supposed Reddit interview question: https://www.interviewquery.com/interview-guides/reddit-data-analyst (Ad comments) I tried hammering out a solution by populating the tables with some fake data (which is probably wrong): https://www.db-fiddle.com/f/3frdKCCacKFCUUYiL9dJj7/12 but now I'm not even sure I can make sense of the question. The idea is that you're supposed to calculate a percentage of comments for each ad in each of the tables. But isn't there going to be a one-to-one relationship between the amount of comments on given ad and the instances of the ad ID? In other words, every time someone comments on "polo shirts", then surely the ad ID for "polo shirts" is going to be there too, making the ratio 100%(?) It's a bit tricky because they don't provide a sampling of data, just the schemas. In completely unrelated news, I have a data engineer interview with Reddit this week
|
# ? Sep 11, 2022 20:06 |
|
Seventh Arrow posted:I'm trying to work out a SQL problem from a supposed Reddit interview question: https://www.interviewquery.com/interview-guides/reddit-data-analyst (Ad comments) I think what they want is that if there are 7 "feed" comments and 3 "moments" comments then you have to say it's 70% "feed" and 30% "moment". You can see that in their example results, the rows sum to 1 (they are proportions rather than percentages as the column titles suggest).
|
# ? Sep 11, 2022 20:21 |
|
Yes, I was kind of wondering about that, since there only seemed to be one "instance" of each ad. How would you calculate the percentage, then? Just assume that "ad" is "10" and do something like "SELECT COUNT comment_id/10 * 100..."?
|
# ? Sep 11, 2022 20:32 |
|
Seventh Arrow posted:Yes, I was kind of wondering about that, since there only seemed to be one "instance" of each ad. How would you calculate the percentage, then? Just assume that "ad" is "10" and do something like "SELECT COUNT comment_id/10 * 100..."? I don't know what you mean by "one instance". I don't know what the most performant way to answer the question is, but I would: 1. ignore the "ads" table entirely* 2. do a SELECT COUNT on the "feed_comments" table with GROUP BY (preferably this is a CTE, but it could be a subquery in the FROM clause; it doesn't matter) 3. do the same thing for the "moments_comments" table 4. do a FULL OUTER JOIN with COALESCE(..., 0) to combine the counts in one result set 5. do the maths to get the final result set in the form they've asked for * the question arises of what they want to happen in the case where an ad has no comments in either table. I'm choosing to ignore those ads because the problem statement on that web page doesn't tell you what to do with them. If you were posed the question in an interview then of course you could ask the interviewer, and hopefully that would be received positively because it would indicate you had a good grasp of the details. e: hello Cloudflare
|
# ? Sep 11, 2022 22:11 |
|
The ads table is the source for ads that exist, which can be larger than the ads that appear in either comments table. Were I the grader, I'd deduct for an outer join because it only addresses the situation where the ad has at least one type of comment. Of course asking the question is valid, but all of no-comments, one-type, and both-types can be handled with a single idiom without an outer join. When I opened the page I jumped right to the hard problem.
|
# ? Sep 11, 2022 23:05 |
|
PhantomOfTheCopier posted:The ads table is the source for ads that exist, which can be larger than the ads that appear in either comments table. Were I the grader, I'd deduct for an outer join because it only addresses the situation where the ad has at least one type of comment. Well then, what percentage would you report for each type of comment in the case where there are no comments? Zero out of the zero comments are "feed" comments, what proportion or percentage do you deem that to be? There isn't an unambiguously mathematically correct answer, so absent any input from the problem setter on what behaviour they would want to see, any approach to those ads is reasonable.
|
# ? Sep 11, 2022 23:11 |
|
I'd ask if the same comment_id can appear both in feed_comments and moments_comments. It probably can't, but it's possible they put that in there as a trap. A lot of interviewers really love to feel clever. Tables with an inherited PK are a gently caress.
|
# ? Sep 12, 2022 14:44 |
|
Yeah I admittedly usually bomb these SQL/Python interview tests but I'm not getting to work with this stuff in my current job anymore, so I might as well try. Also, whenever possible I try to copy/paste the interview challenge and analyze it later so I can get better at this stuff.
|
# ? Sep 12, 2022 14:55 |
|
Hammerite posted:Well then, what percentage would you report for each type of comment in the case where there are no comments? Zero out of the zero comments are "feed" comments, what proportion or percentage do you deem that to be? There isn't an unambiguously mathematically correct answer, so absent any input from the problem setter on what behaviour they would want to see, any approach to those ads is reasonable. B refs A, C refs A... "I'll do a query in B to get counts, and join that with the counts from C!" what if a comment is in B but not C? Oh well... I guess I can do a left join of B and C. But what if it's in C and not B? I guess I need a right join, well then a full outer join of B and C... and still a coalesce to handle the nulls. And now we need to add C refs A. I guess more joins and maybe a case statement for the combinations that are missing. But what if it's in neither and we still want those rows? Another outer join! I can just dump it in Excel and... Can you use A to help? ... Well if I (do this instead) it addresses that issue and all of the above variations, I should have done that in the first place. Does it take joins? Sure, but join against the list with primary key... because you just know the next question will be "Durr can we get the A.sale-price too since that might tell us something useful?" and the answer need not be "let me rewrite the entire query".
|
# ? Sep 12, 2022 22:15 |
|
uguu posted:Those are good points. Not sure if you made any progress on this, but I did a similar thing at my old work - we had about 30 years worth of 3-waters related scada measurements for various sites around the city at 5min intervals. I don't work there anymore and it was a couple of years ago, but basically we had a bunch of ETL processes that collated the data from the various sources and ultimately inserted into postgres/timescaledb, then we used PostgREST to surface a proc using timescaledb's hyperfunctions that would pull the detail data plus a couple of other procs that could be used to surface summary data to help finding the devices of interest. The data looked like: device id/location/time/value/comment, so not wide, just shitloads of it. We surfaced PostgREST through an api gateway that let us limit access via api keys, rate limit, have nice looking url paths and all that good stuff. It took a bit of thought to put it all together, as we were new to postgres and as well as timescaledb, but once it was set up it was pretty good, I really liked the idea of PostgREST and not needing to roll a custom api/db interface.
|
# ? Sep 12, 2022 23:48 |
|
Does anyone know if its possible to get Multicorn (The Postgres plugin that lets you write Foreign Data Wrappers in python) working in Azures Postgress service? We have a giant legacy OSC OSI PI thing (Big ugly enterprisey timescale database thing) that we all hate using but its kinda core to our operation. We're toying with ideas how to cut that expensive crackpipe out and as a first step accessing it via a FTW means we can later on just move it to a local timescaledb table instance. It might just be a hare brained scheme, but we'd like to try it.
|
# ? Sep 13, 2022 07:05 |
|
|
# ? May 11, 2024 07:54 |
|
Most cloud databases prohibit installation into the backends with that level of access. Azure still seems to restrict outbound connections so it's not possible directly. It will need to be custom based on your data and read/write model. Most migrations handle this with some application support, at least fallback logic with reads from the old system and writes to the new system. If zero downtime is mandated there are methods up to the limits of the CAP theorem.
|
# ? Sep 13, 2022 07:56 |