Python

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

Tayter Swift: Nov 18, 2002; Pillbug

Sometimes you're on a train and can't be bothered to whip out a keyboard :shrug:

I do find it rather annoying that the shift key in Juno is a toggle instead of a single letter.

# ¿ Apr 17, 2019 17:06

Adbot: ADBOT LOVES YOU

# ¿ May 11, 2024 12:05

Tayter Swift: Nov 18, 2002; Pillbug

Why are we discussing performance for a task that will take a couple milliseconds a day :confused:

# ¿ Apr 29, 2019 14:38

Tayter Swift: Nov 18, 2002; Pillbug

Oh boy, anecdotes!

I taught myself Python basics in maybe 2011 with some Euler Project exercises. Started using it for work in 2013 with some data science-y tasks because the alternatives were Matlab and SAS and neither appealed to me. My 'IDE' has always been SublimeText 2 and its interpereter, so I never bothered with things outside the standard library. I wrote up a class that in hindsight feels like a stripped-down pandas because I didn't know any better.

Last fall work got some of us a subscription to DataCamp and frankly it's been great. I'm more or less okay with learning from video and they include lots of exercises. Started using Jupyter, pandas and other extensions in the past month and I'm excited to transform a lot of my work products with it all.

# ¿ May 3, 2019 17:08

Tayter Swift: Nov 18, 2002; Pillbug

Stupid newbie matplotlib question that's frustrating the gently caress out of me.

I have a time series with daily data going back several years. On my work computer I can plot it and it gives one tick per year on the x-axis like a sane plot would do, but I get a dreaded FutureWarning I don't understand:

code:

FutureWarning: Using an implicitly registered datetime 
converter for a matplotlib plotting method. The converter was 
registered by pandas on import. Future versions of pandas 
will require you to explicitly register matplotlib converters.

To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()
  warnings.warn(msg, FutureWarning)

At home, however, it's trying to label every day, resulting in an unreadable mess for the x-axis (see screenshot below). How do I fix this so that there is only one tick and label per year?

Only registered members can see post attachments!

# ¿ May 24, 2019 17:23

Tayter Swift: Nov 18, 2002; Pillbug

Turns out the reg field was read in as a string and not a date :doh:

I assumed pandas would read in a date when it was saved to CSV from pandas as a date without explicitly using parse_dates. Welp.

# ¿ May 25, 2019 06:26

Tayter Swift: Nov 18, 2002; Pillbug

Deadite posted:

I don�t know if this is the right place to ask, but I am trying to read a large (750mb) csv file into a pandas dataframe, and it seems to be taking an unreasonably long time. I am limiting the columns to only 8 columns with the usecols option, but the read_csv method is still taking 6 minutes to read the file into python.

I haven�t been using python for very long and I�m coming from a SAS programming background. In SAS this file loads in a few seconds, so I feel like I am screwing something up for this to take so long. I originally tried the read_sas method to load the original 1.5 gb dataset, but I had a memory error and had to convert the file to csv to get around that. The file only has 170k rows.

Oh awesome, I'm in almost exactly the same situation you're in, except my datasets can reach 40GB of SAS. I've pared them down to maybe 11-15GB each, but I still have to convert to csv because I can't figure out how to make read_sas use chunks. I'll have to keep you in mind and bounce ideas with you.

# ¿ Sep 28, 2019 18:15

Tayter Swift: Nov 18, 2002; Pillbug

CarForumPoster posted:

If you're loading them multiple times, pd.read_pickle() is really flexible and WAY WAY faster to load than a CSV. Almost as fast as a feather but I've had issues with feathers before where pickles "just work"

Hm, the [urlhttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html]docs[/url] for read_pickle doesn't really show much flexibility. Ideally I'd like something where I could read something in either in chunks or based on a filter, but pickle looks like it reads in the whole thing.

Suppose I could do sqlalchemy, but while I know SQL I've always found Python implementations to be annoyingly verbose. HDF5 looks interesting but I couldn't wrap my head around it the last time I looked.

# ¿ Sep 30, 2019 01:34

Tayter Swift: Nov 18, 2002; Pillbug

Multilevel indicies destroy my brain and I hates them. I like to convert stuff to tall format and operate that way, then pivot back to what form the output needs to be in. (I think that's called Tidy Data? Dunno)

# ¿ Oct 22, 2019 23:29

Tayter Swift: Nov 18, 2002; Pillbug

I don't do a whole lot of OOP these days, so please bear with me here.

I've made a thin wrapper API around DataFrame, let's call it SDataFrame. All's well and good until I want to call a method that returns a DataFrame, such as df.copy() or df.drop(). This of course returns a DataFrame, but I want it to return an SDataFrame instead. How can I do this? I tired various flavors of overriding the calls with super() but no dice.

# ¿ May 12, 2020 21:17

Tayter Swift: Nov 18, 2002; Pillbug

Protocol7 posted:

I think you can pass a DataFrame into the data kwarg of the DataFrame constructor, so you might be able to just make a new instance of your subclass:
Python code:
df = SDataFrame()

# ... do stuff to df

df_as_SDataFrame = SDataFrame(data=df.copy())

Yeah I've essentially been doing that, but I was wondering if there's a more transparent way to do it in the class definition itself. Ideally I'd like

Python code:

df = SDataFrame()
df2 = df.copy()

To just go ahead and create an SDataFrame instance so I'm not having to keep track of which functions I have to wrap around.

# ¿ May 12, 2020 22:50

Tayter Swift: Nov 18, 2002; Pillbug

I have a 40 GB CSV, about 50 million records by 90-ish fields. I need to sort it by fields VIN and Date, and remove duplicated VINs by most recent Date. My machine has 64GB.

What's a sensible way to accomplish the task? After compacting categories in dask, saving it as parquet and re-reading it into a pandas df it's about 21GB, but it still cant be easily manipulated in pandas, and I know these are not dask-friendly parallelizable operations.

# ¿ Feb 19, 2021 09:20

Tayter Swift: Nov 18, 2002; Pillbug

Haha, fair enough. I've worked with SQL plenty but never worked with SQLite before, so maybe it's time to learn something new.

For now, I just powered through with them df.sort_values and df.drop_duplicate commands, because gently caress RAM. Only took about an hour total and my computer didn't melt, so better than I expected to be honest :shrug:

# ¿ Feb 19, 2021 10:12

Tayter Swift: Nov 18, 2002; Pillbug

I'm starting to get some real imposter syndrome when it comes to pandas and dask. Here's some simple code:

code:

print('writing parquet')

big_future = client.scatter(ddf)

def doit(_):
    _.to_parquet(path='raw 21q4/', schema='infer')
future = client.submit(doit, big_future)
dask.distributed.wait(future)

dask was yelling at me about the task graph when I was simply writing "ddf.to_parquet()," so okay I'm doing it as a scatter like it says. I expect this code to create a parquet file series in the specified path. Instead it churns forever, and deposits nothing. The dask dashboard shows that it's clearly working during this time. Is there something in this code that makes it just ignore the actual 'write the goddamn parquet' step?

Maybe I just don't understand how the hell dask works.

edit: never loving mind, I was writing to the wrong goddamned directory :ughh:

Tayter Swift fucked around with this message at 21:56 on May 19, 2022

# ¿ May 19, 2022 21:48

Tayter Swift: Nov 18, 2002; Pillbug

In my case I'm porting some rather involved ETL on some large fixed-width format files (40GB, 50 million rows by 100+ columns) over from ancient SAS code. It's been pretty slow going as I'm feeling a bit over my head, and right now it's way the hell slower than the SAS code.

# ¿ May 20, 2022 18:53

Tayter Swift: Nov 18, 2002; Pillbug

I use Pandas for my ETL all the time :shrug:

Although I'm switching over to Polars now.

They might want to know if you know how to do proper method chaining and convert datetimes. And they might be sick of seeing the Titanic dataset if that was your plan.

# ¿ Jan 26, 2024 18:26

Tayter Swift: Nov 18, 2002; Pillbug

Chin Strap posted:

Better to never curate the API ever.

And at least it has a pretty long deprecation time so you can catch warnings.

Meanwhile I'm embracing chaos and have switched over to polars. Every week there's a new deprecation to chase :twisted:

# ¿ Feb 27, 2024 18:01

Tayter Swift: Nov 18, 2002; Pillbug

Something I've kinda-sorta struggled with over the last couple of years is reconciling PEP8 and other best practices with the rise of notebooks. Does splitting sections into functions make sense when those sections are in individual cells? Do you avoid referencing results from previous cells when you don't have to, because of what can be a non-linear way of executing code (sometimes I'll just reread data from disk each cell if it's quick)? Docstrings are redundant when we have Markdown, right? What about line lengths?

# ¿ Feb 27, 2024 23:40

Adbot: ADBOT LOVES YOU

# ¿ May 11, 2024 12:05

Tayter Swift: Nov 18, 2002; Pillbug

idk, I like the concept of notebooks having rich-text documentation with the code, and able to show a process step-by-step.

I do wish there was a way to prevent rerunning previous sections of code (apart from an import cell) or altering a variable in a cell, changing the cell and altering the variable again.

# ¿ Feb 28, 2024 00:29

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python