|
Sometimes you're on a train and can't be bothered to whip out a keyboard I do find it rather annoying that the shift key in Juno is a toggle instead of a single letter.
|
# ¿ Apr 17, 2019 17:06 |
|
|
# ¿ May 11, 2024 12:05 |
|
Why are we discussing performance for a task that will take a couple milliseconds a day
|
# ¿ Apr 29, 2019 14:38 |
|
Oh boy, anecdotes! I taught myself Python basics in maybe 2011 with some Euler Project exercises. Started using it for work in 2013 with some data science-y tasks because the alternatives were Matlab and SAS and neither appealed to me. My 'IDE' has always been SublimeText 2 and its interpereter, so I never bothered with things outside the standard library. I wrote up a class that in hindsight feels like a stripped-down pandas because I didn't know any better. Last fall work got some of us a subscription to DataCamp and frankly it's been great. I'm more or less okay with learning from video and they include lots of exercises. Started using Jupyter, pandas and other extensions in the past month and I'm excited to transform a lot of my work products with it all.
|
# ¿ May 3, 2019 17:08 |
|
Stupid newbie matplotlib question that's frustrating the gently caress out of me. I have a time series with daily data going back several years. On my work computer I can plot it and it gives one tick per year on the x-axis like a sane plot would do, but I get a dreaded FutureWarning I don't understand: code:
|
# ¿ May 24, 2019 17:23 |
|
Turns out the reg field was read in as a string and not a date I assumed pandas would read in a date when it was saved to CSV from pandas as a date without explicitly using parse_dates. Welp.
|
# ¿ May 25, 2019 06:26 |
|
Deadite posted:I don’t know if this is the right place to ask, but I am trying to read a large (750mb) csv file into a pandas dataframe, and it seems to be taking an unreasonably long time. I am limiting the columns to only 8 columns with the usecols option, but the read_csv method is still taking 6 minutes to read the file into python. Oh awesome, I'm in almost exactly the same situation you're in, except my datasets can reach 40GB of SAS. I've pared them down to maybe 11-15GB each, but I still have to convert to csv because I can't figure out how to make read_sas use chunks. I'll have to keep you in mind and bounce ideas with you.
|
# ¿ Sep 28, 2019 18:15 |
|
CarForumPoster posted:If you're loading them multiple times, pd.read_pickle() is really flexible and WAY WAY faster to load than a CSV. Almost as fast as a feather but I've had issues with feathers before where pickles "just work" Hm, the [urlhttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html]docs[/url] for read_pickle doesn't really show much flexibility. Ideally I'd like something where I could read something in either in chunks or based on a filter, but pickle looks like it reads in the whole thing. Suppose I could do sqlalchemy, but while I know SQL I've always found Python implementations to be annoyingly verbose. HDF5 looks interesting but I couldn't wrap my head around it the last time I looked.
|
# ¿ Sep 30, 2019 01:34 |
|
Multilevel indicies destroy my brain and I hates them. I like to convert stuff to tall format and operate that way, then pivot back to what form the output needs to be in. (I think that's called Tidy Data? Dunno)
|
# ¿ Oct 22, 2019 23:29 |
|
I don't do a whole lot of OOP these days, so please bear with me here. I've made a thin wrapper API around DataFrame, let's call it SDataFrame. All's well and good until I want to call a method that returns a DataFrame, such as df.copy() or df.drop(). This of course returns a DataFrame, but I want it to return an SDataFrame instead. How can I do this? I tired various flavors of overriding the calls with super() but no dice.
|
# ¿ May 12, 2020 21:17 |
|
Protocol7 posted:I think you can pass a DataFrame into the data kwarg of the DataFrame constructor, so you might be able to just make a new instance of your subclass: Yeah I've essentially been doing that, but I was wondering if there's a more transparent way to do it in the class definition itself. Ideally I'd like Python code:
|
# ¿ May 12, 2020 22:50 |
|
I have a 40 GB CSV, about 50 million records by 90-ish fields. I need to sort it by fields VIN and Date, and remove duplicated VINs by most recent Date. My machine has 64GB. What's a sensible way to accomplish the task? After compacting categories in dask, saving it as parquet and re-reading it into a pandas df it's about 21GB, but it still cant be easily manipulated in pandas, and I know these are not dask-friendly parallelizable operations.
|
# ¿ Feb 19, 2021 09:20 |
|
Haha, fair enough. I've worked with SQL plenty but never worked with SQLite before, so maybe it's time to learn something new. For now, I just powered through with them df.sort_values and df.drop_duplicate commands, because gently caress RAM. Only took about an hour total and my computer didn't melt, so better than I expected to be honest
|
# ¿ Feb 19, 2021 10:12 |
|
I'm starting to get some real imposter syndrome when it comes to pandas and dask. Here's some simple code:code:
Maybe I just don't understand how the hell dask works. edit: never loving mind, I was writing to the wrong goddamned directory Tayter Swift fucked around with this message at 21:56 on May 19, 2022 |
# ¿ May 19, 2022 21:48 |
|
In my case I'm porting some rather involved ETL on some large fixed-width format files (40GB, 50 million rows by 100+ columns) over from ancient SAS code. It's been pretty slow going as I'm feeling a bit over my head, and right now it's way the hell slower than the SAS code.
|
# ¿ May 20, 2022 18:53 |
|
I use Pandas for my ETL all the time Although I'm switching over to Polars now. They might want to know if you know how to do proper method chaining and convert datetimes. And they might be sick of seeing the Titanic dataset if that was your plan.
|
# ¿ Jan 26, 2024 18:26 |
|
Chin Strap posted:Better to never curate the API ever. And at least it has a pretty long deprecation time so you can catch warnings. Meanwhile I'm embracing chaos and have switched over to polars. Every week there's a new deprecation to chase
|
# ¿ Feb 27, 2024 18:01 |
|
Something I've kinda-sorta struggled with over the last couple of years is reconciling PEP8 and other best practices with the rise of notebooks. Does splitting sections into functions make sense when those sections are in individual cells? Do you avoid referencing results from previous cells when you don't have to, because of what can be a non-linear way of executing code (sometimes I'll just reread data from disk each cell if it's quick)? Docstrings are redundant when we have Markdown, right? What about line lengths?
|
# ¿ Feb 27, 2024 23:40 |
|
|
# ¿ May 11, 2024 12:05 |
|
idk, I like the concept of notebooks having rich-text documentation with the code, and able to show a process step-by-step. I do wish there was a way to prevent rerunning previous sections of code (apart from an import cell) or altering a variable in a cell, changing the cell and altering the variable again.
|
# ¿ Feb 28, 2024 00:29 |