Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Tayter Swift
Nov 18, 2002

Pillbug
Sometimes you're on a train and can't be bothered to whip out a keyboard :shrug:

I do find it rather annoying that the shift key in Juno is a toggle instead of a single letter.

Adbot
ADBOT LOVES YOU

Tayter Swift
Nov 18, 2002

Pillbug
Why are we discussing performance for a task that will take a couple milliseconds a day :confused:

Tayter Swift
Nov 18, 2002

Pillbug
Oh boy, anecdotes!

I taught myself Python basics in maybe 2011 with some Euler Project exercises. Started using it for work in 2013 with some data science-y tasks because the alternatives were Matlab and SAS and neither appealed to me. My 'IDE' has always been SublimeText 2 and its interpereter, so I never bothered with things outside the standard library. I wrote up a class that in hindsight feels like a stripped-down pandas because I didn't know any better.

Last fall work got some of us a subscription to DataCamp and frankly it's been great. I'm more or less okay with learning from video and they include lots of exercises. Started using Jupyter, pandas and other extensions in the past month and I'm excited to transform a lot of my work products with it all.

Tayter Swift
Nov 18, 2002

Pillbug
Stupid newbie matplotlib question that's frustrating the gently caress out of me.

I have a time series with daily data going back several years. On my work computer I can plot it and it gives one tick per year on the x-axis like a sane plot would do, but I get a dreaded FutureWarning I don't understand:

code:
FutureWarning: Using an implicitly registered datetime 
converter for a matplotlib plotting method. The converter was 
registered by pandas on import. Future versions of pandas 
will require you to explicitly register matplotlib converters.

To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()
  warnings.warn(msg, FutureWarning)
At home, however, it's trying to label every day, resulting in an unreadable mess for the x-axis (see screenshot below). How do I fix this so that there is only one tick and label per year?

Only registered members can see post attachments!

Tayter Swift
Nov 18, 2002

Pillbug
Turns out the reg field was read in as a string and not a date :doh:

I assumed pandas would read in a date when it was saved to CSV from pandas as a date without explicitly using parse_dates. Welp.

Tayter Swift
Nov 18, 2002

Pillbug

Deadite posted:

I don’t know if this is the right place to ask, but I am trying to read a large (750mb) csv file into a pandas dataframe, and it seems to be taking an unreasonably long time. I am limiting the columns to only 8 columns with the usecols option, but the read_csv method is still taking 6 minutes to read the file into python.

I haven’t been using python for very long and I’m coming from a SAS programming background. In SAS this file loads in a few seconds, so I feel like I am screwing something up for this to take so long. I originally tried the read_sas method to load the original 1.5 gb dataset, but I had a memory error and had to convert the file to csv to get around that. The file only has 170k rows.

Oh awesome, I'm in almost exactly the same situation you're in, except my datasets can reach 40GB of SAS. I've pared them down to maybe 11-15GB each, but I still have to convert to csv because I can't figure out how to make read_sas use chunks. I'll have to keep you in mind and bounce ideas with you.

Tayter Swift
Nov 18, 2002

Pillbug

CarForumPoster posted:

If you're loading them multiple times, pd.read_pickle() is really flexible and WAY WAY faster to load than a CSV. Almost as fast as a feather but I've had issues with feathers before where pickles "just work"

Hm, the [urlhttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html]docs[/url] for read_pickle doesn't really show much flexibility. Ideally I'd like something where I could read something in either in chunks or based on a filter, but pickle looks like it reads in the whole thing.

Suppose I could do sqlalchemy, but while I know SQL I've always found Python implementations to be annoyingly verbose. HDF5 looks interesting but I couldn't wrap my head around it the last time I looked.

Tayter Swift
Nov 18, 2002

Pillbug
Multilevel indicies destroy my brain and I hates them. I like to convert stuff to tall format and operate that way, then pivot back to what form the output needs to be in. (I think that's called Tidy Data? Dunno)

Tayter Swift
Nov 18, 2002

Pillbug
I don't do a whole lot of OOP these days, so please bear with me here.

I've made a thin wrapper API around DataFrame, let's call it SDataFrame. All's well and good until I want to call a method that returns a DataFrame, such as df.copy() or df.drop(). This of course returns a DataFrame, but I want it to return an SDataFrame instead. How can I do this? I tired various flavors of overriding the calls with super() but no dice.

Tayter Swift
Nov 18, 2002

Pillbug

Protocol7 posted:

I think you can pass a DataFrame into the data kwarg of the DataFrame constructor, so you might be able to just make a new instance of your subclass:

Python code:
df = SDataFrame()

# ... do stuff to df

df_as_SDataFrame = SDataFrame(data=df.copy())

Yeah I've essentially been doing that, but I was wondering if there's a more transparent way to do it in the class definition itself. Ideally I'd like
Python code:
df = SDataFrame()
df2 = df.copy()
To just go ahead and create an SDataFrame instance so I'm not having to keep track of which functions I have to wrap around.

Tayter Swift
Nov 18, 2002

Pillbug
I have a 40 GB CSV, about 50 million records by 90-ish fields. I need to sort it by fields VIN and Date, and remove duplicated VINs by most recent Date. My machine has 64GB.

What's a sensible way to accomplish the task? After compacting categories in dask, saving it as parquet and re-reading it into a pandas df it's about 21GB, but it still cant be easily manipulated in pandas, and I know these are not dask-friendly parallelizable operations.

Tayter Swift
Nov 18, 2002

Pillbug
Haha, fair enough. I've worked with SQL plenty but never worked with SQLite before, so maybe it's time to learn something new.

For now, I just powered through with them df.sort_values and df.drop_duplicate commands, because gently caress RAM. Only took about an hour total and my computer didn't melt, so better than I expected to be honest :shrug:

Tayter Swift
Nov 18, 2002

Pillbug
I'm starting to get some real imposter syndrome when it comes to pandas and dask. Here's some simple code:

code:
print('writing parquet')

big_future = client.scatter(ddf)

def doit(_):
    _.to_parquet(path='raw 21q4/', schema='infer')
future = client.submit(doit, big_future)
dask.distributed.wait(future)
dask was yelling at me about the task graph when I was simply writing "ddf.to_parquet()," so okay I'm doing it as a scatter like it says. I expect this code to create a parquet file series in the specified path. Instead it churns forever, and deposits nothing. The dask dashboard shows that it's clearly working during this time. Is there something in this code that makes it just ignore the actual 'write the goddamn parquet' step?

Maybe I just don't understand how the hell dask works.

edit: never loving mind, I was writing to the wrong goddamned directory :ughh:

Tayter Swift fucked around with this message at 21:56 on May 19, 2022

Tayter Swift
Nov 18, 2002

Pillbug
In my case I'm porting some rather involved ETL on some large fixed-width format files (40GB, 50 million rows by 100+ columns) over from ancient SAS code. It's been pretty slow going as I'm feeling a bit over my head, and right now it's way the hell slower than the SAS code.

Tayter Swift
Nov 18, 2002

Pillbug
I use Pandas for my ETL all the time :shrug: Although I'm switching over to Polars now.

They might want to know if you know how to do proper method chaining and convert datetimes. And they might be sick of seeing the Titanic dataset if that was your plan.

Tayter Swift
Nov 18, 2002

Pillbug

Chin Strap posted:

Better to never curate the API ever.

And at least it has a pretty long deprecation time so you can catch warnings.

Meanwhile I'm embracing chaos and have switched over to polars. Every week there's a new deprecation to chase :twisted:

Tayter Swift
Nov 18, 2002

Pillbug
Something I've kinda-sorta struggled with over the last couple of years is reconciling PEP8 and other best practices with the rise of notebooks. Does splitting sections into functions make sense when those sections are in individual cells? Do you avoid referencing results from previous cells when you don't have to, because of what can be a non-linear way of executing code (sometimes I'll just reread data from disk each cell if it's quick)? Docstrings are redundant when we have Markdown, right? What about line lengths?

Adbot
ADBOT LOVES YOU

Tayter Swift
Nov 18, 2002

Pillbug
idk, I like the concept of notebooks having rich-text documentation with the code, and able to show a process step-by-step.

I do wish there was a way to prevent rerunning previous sections of code (apart from an import cell) or altering a variable in a cell, changing the cell and altering the variable again.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply