Python

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

I don�t know if this is the right place to ask, but I am trying to read a large (750mb) csv file into a pandas dataframe, and it seems to be taking an unreasonably long time. I am limiting the columns to only 8 columns with the usecols option, but the read_csv method is still taking 6 minutes to read the file into python.

I haven�t been using python for very long and I�m coming from a SAS programming background. In SAS this file loads in a few seconds, so I feel like I am screwing something up for this to take so long. I originally tried the read_sas method to load the original 1.5 gb dataset, but I had a memory error and had to convert the file to csv to get around that. The file only has 170k rows.

Does anyone have an idea why this is taking so long? Or is this just a normal amount of time for python to process this file? Google/stack exchange are getting me nowhere.

Edit: Never mind, I switched the file from a network drive to my local drive and now it loads in 4 seconds. I guess it�s a network I/O issue and not a python issue

Deadite fucked around with this message at 03:00 on Sep 28, 2019

# ¿ Sep 28, 2019 02:23

Adbot: ADBOT LOVES YOU

# ¿ May 12, 2024 07:06

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

Is there a way to create pdf reports without a specific package that does that? At work I am limited only to the packages that come with anaconda, and I can�t see anything that will work.

Everything I find on google is just �install reportlab� and it�s frustrating because my IT department won�t let me

# ¿ Dec 1, 2019 21:44

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

I�m pretty new to python, and I didn�t think to just download the code and import it that way.

I�m used to programming in SAS, so having to find packages to accomplish tasks is hard to get the hang of. I keep thinking there must be a way to do everything in vanilla python and that�s the wrong way to think about creating programs it seems

# ¿ Dec 2, 2019 04:27

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

Does anyone have experience running Dask on AWS? I'm trying to start a dask cluster on an EC2 instance but the dask scheduler never seems to start. I'm using EC2Cluster from dask-cloudprovider. Here's what I see on PuTTy when the cluster context manager executes:

code:

Creating scheduler instance
Created instance i-0be36abee1a3206c9 as dask-c958e0dc-scheduler
Waiting for scheduler to run at ip:port

and that's it. So the scheduler never starts running and no workers are created, but I'm not getting any errors or anything that I can trace back to a problem. Google is getting me nowhere with this issue either, so can anyone tell me what I'm doing wrong?

# ¿ May 4, 2021 03:57

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

I'm having an issue trying to add a matplotlib graph to a tkinter GUI where the legend to the graph is getting cut off if I move it to below the chart.

Does anyone know a way to display the legend when it is outside of the chart? FigureCanvasTkAgg doesn't have a height argument so I can't just stretch the viewable area. It looks like it is some kind of automatic resizing problem that I can't figure out a way around.

Here's the test code:

Python code:

import tkinter as tk
from pandas import DataFrame
import matplotlib.pyplot as plt
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg

data1 = {'Country': ['US','CA','GER','UK','FR'],
         'GDP_Per_Capita': [45000,42000,52000,49000,47000]
        }
df1 = DataFrame(data1,columns=['Country','GDP_Per_Capita'])

root= tk.Tk() 
  
figure1 = plt.Figure(figsize=(6,5), dpi=100)
ax1 = figure1.add_subplot(111)
bar1 = FigureCanvasTkAgg(figure1, root)
bar1.get_tk_widget().pack(side=tk.LEFT, fill=tk.BOTH)
df1 = df1[['Country','GDP_Per_Capita']].groupby('Country').sum()
df1.plot(kind='bar', legend=True, ax=ax1)
ax1.legend(bbox_to_anchor=(0.7, -0.12)) 
ax1.set_title('Country Vs. GDP Per Capita')

root.mainloop()

# ¿ Jun 17, 2021 14:42

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

OnceIWasAnOstrich posted:

I've never had this problem in the context of another GUI but I've definitely run into similar issues with bits of a matplotlib figure getting rendered outside the bounds of an image. It is usually something that a call to tight_layout() or other layout-modifying functions can address.

Thanks, tight_layout is exactly what I was looking for.

# ¿ Jun 17, 2021 16:49

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

Gobbeldygook posted:

I am a coding newbie working through Learn Python The Hard Way. For exercise 36 he says to make a text-based adventure game. I decided to add quicktime events, which requires timed input. I found a complicated, Windows-only method on reddit and it works fine, but I decided to try to make the seemingly simple crossplatform solution work just for it's own sake. Here is what I have:
code:
from threading import Timer

def dead():
    print("You are dead!")

def samurai():
    timeout = 5
    t = Timer(timeout, dead, [])
    t.start()
    print(f"You take one out and engage the other! You have {timeout} seconds per move\n")
    print("HIGH STAB\n1. DUCK\n2. JUMP\n3. PARRY\n4. STRIKE")
    answer = input()
    answer = int(answer)
    if answer == 1:
        t.cancel()
        t.start()
        print("HE\'S OPEN\n1. DUCK\n2. JUMP\n3. PARRY\n4. STRIKE")
        answer = input()
        answer = int(answer)
        if answer == 4:
            t.cancel()
            print("VICTORY!")
        else:
            t.cancel()
            dead()
    else:
        t.cancel()
        dead()
    t.cancel()

samurai()
The way I thought this would work is after the player put in a correct move I could cancel the timer and then start a new timer for the next move. After inputting "1" I get "RuntimeError: threads can only be started once".

Is there some easy way to do what I want or should I stick to my Windows-only method?

The easiest way to fix this would just be to reset t to a new Timer object, like

code:

if answer == 1:
        t.cancel()
        t = Timer(timeout, dead, []).start()

I don't know if that is the best way to fix it though

# ¿ Jul 19, 2021 03:20

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

I'm having a problem figuring out a regex and hopefully someone can point to my mistake. Here's the test case:

code:

import re
re.findall(r"\bX_[RR|FF]?[_]?D", 'X_FF_D X_RR_D X_D')

My desired result is a list with all of the words in the string (['X_FF_D', 'X_RR_D', 'X_D']) but I can't figure out how to structure the or for RR|FF to return what I want. It doesn't seem to like multiple characters the way I have it now.

The code works when it's just one character:

code:

import re
re.findall(r"\bX_[R|F]?[_]?D", 'X_F_D X_R_D X_D')

# ¿ Dec 8, 2021 22:51

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

Thanks to you both, I'm not great with groups and didn't realize I needed an outer group along with the inner group to return what I wanted

# ¿ Dec 8, 2021 23:07

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

Following along with this video helped me immensely when I was starting out with pandas

https://youtu.be/5JnMutdy6Fw

# ¿ Dec 14, 2021 13:42

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

Can anyone help me understand why this example ends in an error:

code:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[np.nan,1,2,3,4,5,6,np.nan],'b':[7,np.nan,8,9,10,np.nan,11,12]})
test = df[['a']].reset_index(drop=True)
test = test.fillna('Missing')
test['a'] = np.where((test['a'] == 'Missing'), test['a'], np.where((test['a'].astype('float') < 3) | (test['a'].astype('float') > 5), 1000, test['a']))
test

but this works as expected?

code:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[np.nan,1,2,3,4,5,6,np.nan],'b':[7,np.nan,8,9,10,np.nan,11,12]})
test = df[['a']].reset_index(drop=True)
test['a'] = np.where((test['a'].isna()), test['a'], np.where((test['a'].astype('float') < 3) | (test['a'].astype('float') > 5), 1000, test['a']))
test

I would think that both where statements would only return the population that doesn't satisfy the first condition to the second where statement, so the first code shouldn't be converting any strings to float if the second code isn't, but it doesn't seem to work that way

# ¿ Nov 22, 2022 23:29

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

I can tell that from the error, what I don't understand is why that string would have the method applied in the first place, since it should have been filtered out in the first part of the where function.

# ¿ Nov 24, 2022 02:39

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

QuarkJets posted:

That's not the order of operations; the way you have this set up, astype occurs first, and that new dataset would become one of the inputs of the "where" function

So astype is the first part executed before the first "where" happens? I thought the first "where" statement would have filtered out the 'Missing' strings before the second where executes and applies the astypes

# ¿ Nov 24, 2022 03:04

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

Oooooooh okay, I see now. Thank you for your very clear explanation. I didn't think about how all of the arguments in the "where" function needed to be resolved before the function executes, but that is how literally all functions work. I just couldn't see it in this case for whatever reason.

For context I was trying to write a "where" statement to replace values in a column that already contained the "Missing" strings, so I needed to find a way to filter those out, then apply the criteria to the intended population, and return all the values back to the dataframe with every value in its original index position.

# ¿ Nov 24, 2022 05:53

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

Does anyone have a good resource for dask that can be understood by an idiot? I've been struggling with the library for way too long and I still have no idea what I'm doing.

It's really frustrating to think you're running code in parallel, only to find out that you're not actually using all the threads in your processor unless you set the config to either 'multiprocessing' or 'distributed' and I don't know the difference between them. All I know is that 'distributed' runs faster than 'multiprocessing' but also causes my computer to restart with larger files. It also produces cryptic messages like this:

code:

distributed.nanny - WARNING - Worker process still alive after 4 seconds, killing

Why'd you have to kill my worker? He was only four seconds old!

Anyway I feel like I need a better foundation and reading the documentation is getting me nowhere.

# ¿ Apr 22, 2023 20:31

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

So here's my dask test case, and it's a little misleading because when I check the times the compute() without a LocalCluster/Client runs much, much faster for the example than it does with the program I'm actually building. The real program runs faster without a LocalCluster until the file gets to be around 1GB in size, then the LocalCluster distributed compute starts being faster. I can't seem to recreate some of the errors I'm seeing with large files with the example code though. It does top out at 5 million rows before I get this error, which I don't get with my real program. This whole thing is so confusing.

code:

ValueError: 3713192179 exceeds max_bin_len(2147483647)

Python code:

import pandas as pd
import numpy as np
from datetime import datetime
from dask import delayed, compute
from dask.distributed import Client, LocalCluster
import multiprocessing
import platform

if platform.system().lower() == 'windows':
    import multiprocessing.popen_spawn_win32
else:
    import multiprocessing.popen_spawn_posix
    
cols = 100

for rows in range(1000000,11000000,1000000):
    # Create dataframe (single column with integers that match row number)
    df = pd.DataFrame(np.arange(rows).reshape(rows,1))
    df.columns = ['col1']

    # Duplicate that column to number set in 'cols'
    for i in range(cols):
        df['col'+str(i)] = df['col1']
        
        
    for splits in range(1,5):
        
        # Split dataframe by columns
        k, m = divmod(len(df.columns), splits)
        list_of_split_dfs = list((df[df.columns[i*k+min(i, m):(i+1)*k+min(i+1, m)]] for i in range(splits)))

        # Function that accepts dataframe and returns a dictionary 
        # with column name as key, and a dictionary with min/max as a value
        def find_minmax(df):

            results = {}
            for col in df.columns:
                results[col] = {}
                results[col]['min'] = df[col].min()
                results[col]['max'] = df[col].max()

            return results

        # Function to combine the list of dictionaries into one
        def combine_dicts(list_of_dicts):
            results = {}
            for d in list_of_dicts:
                results.update(d)
            return results

        # Create a list of delayed find_minmax functions
        list_of_dicts = []
        for df in list_of_split_dfs:
            list_of_dicts.append(delayed(find_minmax)(df))

        # Submit the delayed find_minmax functions to combine_dicts function
        combined_dictionary = delayed(combine_dicts)(list_of_dicts)
        start = datetime.now()
        combined_dictionary.compute()
        time_base = (datetime.now()-start).total_seconds()

        # Submit the function to a multiprocessing client
        start = datetime.now()
        with LocalCluster(n_workers=splits, dashboard_address=None) as cluster, Client(cluster) as client:
            combined_dictionary.compute(scheduler='multiprocessing')
        time_mp = (datetime.now()-start).total_seconds()

        # Submit the function to a distributed client
        start = datetime.now()
        with LocalCluster(n_workers=splits, dashboard_address=None) as cluster, Client(cluster) as client:
            combined_dictionary.compute(scheduler='distributed')
        time_dist = (datetime.now()-start).total_seconds()

        line = ', '.join([str(rows),str(cols),str(splits),str(time_base),str(time_mp),str(time_dist)])
        print(line)

# ¿ Apr 23, 2023 19:56

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

I have a quick question that I cannot figure out and the keywords involved make googling difficult.

I�m also having a hard time explaining this so bear with me.

I am trying to write an if statement that checks that one variable isn�t in a list of values, or if that variable isn�t equal to a specific value while another variable is equal to a specific value at the same time. My test case is below. In this case I only want to see �Right� when x is either �a�, �b�, or �c� OR x is �d� while y is also 3.

code:

x = �d�
y = 3

if not (x == �d� and y == 3) or (x not in [�a�, �b�, �c�]):
    print(�Wrong�)
else:
    print(�Right�)

I can get the �Right� result if I only use one of these tests at a time, but when I link them with an �or� it stops working. Am I missing something obvious here?

# ¿ Feb 25, 2024 16:37

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

FISHMANPET posted:

Is the not negating the entire statement or just the part before the or? I think you need more parenthesis, because the not isn't applying precisely how you want it.

I realize this is a simplified example to demonstrate the issue, but in production do you want "right" to be an outcome? Because if so, I don't see why you're even bothering with the not to start with the negative case.

Yes, the �not� should only apply to the first conditions wrapped in parentheses and not the one after the �or�. I tried wrapping the whole thing in parentheses like (not (x == �d� and y == 3)) but that doesn�t seem to work either.

�Right� is the desired result when x = �d� and y = 3. The test case is just an example from a much larger program that I can�t easily restructure so I just need to figure out if what I�m trying to do is possible. It feels like there should be a way to do this but I can�t find any info.

# ¿ Feb 25, 2024 17:17

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

boofhead posted:

yeah, why are you writing it like that? why not just write
Python code:
x = 'd'
y = 3

if (x == 'd' and y == 3) or (x in ['a', 'b', 'c']):
    print('right')
else:
    print('wrong')
the reason your example doesn't work the way you want it to is because combing conditions with an 'or' means that it checks if any of the conditions is True and goes down that path

and the second condition evaluates to True because x is NOT in ['a', 'b', 'c']

so what it's doing in your example is:
Python code:
# for x == 'd' and y == 3

=> not (x == �d� and y == 3) or (x not in [�a�, �b�, �c�])

=> not (True) or (True)

=> False or True
so it goes down that path every time

e: using negatives like that is absolute hell though so if i caught anybody writing code like that i would judge them forever and probably try to get them fired. out of a cannon, into a volcano

I have to do it this way because these two new conditions are just an addition to a very long list of existing conditions, and at least some of them will need to be negated. I inherited this program and I can�t change how it�s structured without causing a lot of drama so I�m trying to do the best with what I have. I agree that it�s going to be a (more) confusing mess from here on out.

# ¿ Feb 25, 2024 17:35

Adbot: ADBOT LOVES YOU

# ¿ May 12, 2024 07:06

Deadite: Aug 30, 2003; A fat guy, a watermelon, and a stack of magazines?
Family.

boofhead posted:

refactor the whole thing imo

but if you're determined to go ahead with it, this should do what you're asking
Python code:
if not ((x == 'd' and y == 3) or (x in ['a', 'b', 'c'])):
    print('wrong')
else:
    print('right')
hope that code has amazing unit tests btw

Perfect, thank you. And I'm sure that team doesn't do any unit testing. I got pulled in to help out because they were running behind. At the end of the month this program will not be my problem again until an ironic reorg forces me to maintain it.

Here's a more accurate representation of the issue, but pretend there are about 20 more variables that need to be tested:

Python code:

v = 'N'
w = 'N'
x = 'a'
y = 3

if (v == 'N' or w == 'N') and (not ((x == 'd' and y == 3) or (x in ['a', 'b', 'c']))):
    print('do something')
else:
    print('do something else')

# ¿ Feb 25, 2024 17:51

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python