Python information and short questions megathread.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›3 »

vikingstrike: Sep 23, 2007; whats happening, captain

Sitting down over a weekend and reading the docs and tweaking my matplotlibrc file to my liking was one of the best things I ever did. While I went through the docs, I created dummy template files that have the code for the most common plots I do, so now I just copy and paste it over and I'm off to the races. I tried out seaborn and some of the other graphics libraries, but just always came back to matplotlib.

Cingulate posted:

I think that's being a bit too down on matplotlib - you can still do reasonably pythonic stuff that would be rather different in Matlab. Like, iterating over axes to set their properties etc.

Yeah, this is really nice as well. Especially when you are handling multiple subplots in one figure.

# ¿ May 22, 2015 18:17

Adbot: ADBOT LOVES YOU

# ¿ May 9, 2024 12:20

vikingstrike: Sep 23, 2007; whats happening, captain

Opulent Ceremony posted:

What's 3 doing differently?

Treating range() like xrange() from python 2, I think.

# ¿ Jun 9, 2015 16:24

vikingstrike: Sep 23, 2007; whats happening, captain

Amberskin posted:

Hello,

I'm taking a MOOC about machine learning which has python based lab exercises. My knowledge of python is quite basic, but I'm not having big trouble with the language. Actually, I like what I've seen...

The labs are done using iPython notebooks, against a configured virtual machine, so it is a "black box" for me: I just open the browser, point to the VM and start coding. My question is if there is some way to debug the code using iPhyton without resorting to the old and ugly practice of spamming the code with "print" statements. I've done some googling, but found nothing (I admit I haven't gone beyond the basic searchs). Could anyone point me to any doc/tutorial/example to teach myself how to do it?

I'd recommend starting by reading about the python debugger: https://docs.python.org/2/library/pdb.html It works just fine in iPython notebooks.

# ¿ Jul 30, 2015 22:11

vikingstrike: Sep 23, 2007; whats happening, captain

If you don't have many labels, just pass them in explicitly and be done with it.

# ¿ Aug 3, 2015 07:19

vikingstrike: Sep 23, 2007; whats happening, captain

Gothmog1065 posted:

Okay, I need some help, namely with the error that's being thrown. Basically link_list is a list of tuple pairs ( ex: list = [(1,2),(2,3)...] ), but I'm having issues removing them at the end. Here's my lovely code. I'm working through this myself, so I'm hoping to get that specific part working before I move on (then I'll get advice on how to be less stupid with my code).

Python code:

# n: the total number of nodes in the level, including the gateways
# l: the number of links
# e: the number of exit gateways
n, l, e = [int(i) for i in input().split()]
link_list = []
for i in range(l):
	# n1: N1 and N2 defines a link between these nodes
	n1, n2 = [int(j) for j in input().split()]
	link_list.append((n1,n2))
	
gateways = []
for i in range(e):
	ei = int(input())  # the index of a gateway node
	gateways.append(ei)

# game loop
while 1:
	si = int(input())  # The index of the node on which the Skynet agent is positioned this turn
	enemy = si #Let's use less stupid variable names
	for links in link_list:
		print(links[0],links[1],enemy,file=sys.stderr)
		if (links[0] == enemy) and (links[1] in gateways):
			disconnect = (enemy, links[1])
			#print("One", disconnect, file=sys.stderr)
			break
		elif (links[1] == enemy) and (links[0] in gateways):
			disconnect = (enemy, links[0])
			#print("Two", disconnect, file = sys.stderr)
			break
		else:
			disconnect = link_list[-1:]
			#print("Three",disconnect, file=sys.stderr)
			break
			
	link_list.remove(disconnect) #< -------- This is the line
	print(str(disconnect[0]) + " " + str(disconnect[1]))

The error being thrown is:

code:

ValueError: list.remove(x): x not in list
at Answer.py. in <module> on line 43

I went ahead and bolded the line that is having the issue.

Indentation error technically, but really a scope problem. You define "disconnect" at a level lower (inside the for loop) than what you are calling (inside the while loop).

# ¿ Sep 8, 2015 20:48

vikingstrike: Sep 23, 2007; whats happening, captain

Thermopyle posted:

Because the vast majority of things people do with Python get no particular benefit out of numpy?

This. When I first started using python I had no needs outside of the standard library, but once I started doing my data analysis/cleaning/plotting in python I moved over to the Anaconda distribution since pandas, numpy, and MPL are daily drivers for me now. Once of the things I loved as a beginner to the language was how robust (at least to me) the standard library was in helping me do what I needed.

# ¿ Oct 16, 2015 15:12

vikingstrike: Sep 23, 2007; whats happening, captain

I usually use a combination of requests and Beautiful Soup to download and scrape information from websites. Not sure if there is anything newer that has supplanted this. For working with tabular data I use pandas.

# ¿ Oct 17, 2015 03:26

vikingstrike: Sep 23, 2007; whats happening, captain

Hughmoris posted:

I'm trying to explore and understand pandas and numpy. i have a data set that looks like this:
code:
UNIT        DISCHARGE DATE             DISCHARGE TO      DISCHARGE DELAY(hh:mm:ss)
CARD    10/01/2015 15:10:00    10/01/2015 06:51:42    8:18:18
NEUR    10/01/2015 10:15:00    10/01/2015 07:13:58    3:01:02
SURG    10/01/2015 09:30:00    10/01/2015 07:15:38    2:14:22
CARD    10/01/2015 11:23:00    10/01/2015 07:17:27    4:05:33
CARD    10/01/2015 15:20:00    10/01/2015 07:22:01    7:57:59
NEUR    10/01/2015 14:26:00    10/01/2015 07:23:12    7:02:48
...
...
Is there a simple way to get an average of "DISCHARGE DELAY" for each unit? I'd like to get the average "DISCHARGE DELAY" for each unit, then do a simple graph to display those averages.

Phone posting, but get discharge delay into numeric units and then use group by:

df.groupby("unit", sort=True)["discharge_delay"].mean()

This will find the mean discharge delay for each unit in the data. You can then take the series that's returned and easily plot a bar graph in matplotlib or the the built in pandas functions.

# ¿ Oct 20, 2015 03:53

vikingstrike: Sep 23, 2007; whats happening, captain

Fwiw, Anaconda on Windows has been pretty hassle free for me and comes with a bunch of stuff pre installed (well at least everything I usually use regularly).

# ¿ Jan 29, 2016 01:07

vikingstrike: Sep 23, 2007; whats happening, captain

politicorific posted:

Jesus christ...

So I tried New York, but if i put in Taipei... no dice.

url = 'http://aqicn.org/taiwan/songshan'

Also, Anaconda/Spyder is really finicky about lxml... half the time it cannot find "import lxml.html"

I've never had any issues like this with Anaconda before. Do you have multiple python installations on the same machine?

# ¿ Jan 29, 2016 15:42

vikingstrike: Sep 23, 2007; whats happening, captain

Is morphology unique to each word (this may be stupid if you know the problem)? If so, couldn't you just merge the two frames on morphology and then keep the intersection?

pandas.merge(df1, df2, on="morphology", how="inner")

# ¿ Feb 7, 2016 14:43

vikingstrike: Sep 23, 2007; whats happening, captain

Yeah, string.whitespace is a longer string than just "\t"

# ¿ Mar 1, 2016 17:22

vikingstrike: Sep 23, 2007; whats happening, captain

You can use a list comprehension to do that cleaner. Or you could use a dict of lists.

# ¿ Apr 8, 2016 18:59

vikingstrike: Sep 23, 2007; whats happening, captain

drainpipe posted:

I'm coding up the k-means clustering algorithm for a homework assignment. I'm given a list of data points (each with two coordinates) and I need to put them into k clusters. The way I'm doing it is to define k empty lists and fill them up with indices of my data points according to which cluster they belong to. I just realized that adding an extra column to the data array to indicate cluster assignment might work just as well.

edit: actually, that probably wouldn't be good since I need to calculate the means of a cluster and accessing the points of a cluster would be much harder if I had to read it from a field.

Have you looked into pandas? Stuff like this seems like it would be suited well. For example, to find the mean of a cluster is a simple groupby function call. Accessing the data points of a cluster is just indexing, etc.

# ¿ Apr 8, 2016 21:06

vikingstrike: Sep 23, 2007; whats happening, captain

drainpipe posted:

Oh cool, I haven't looked at that. Are you talking about the DataFrames structure? I'm not familiar with Python and its packages so I'm using this class also as an opportunity to learn Python.

Yep.

# ¿ Apr 8, 2016 21:40

vikingstrike: Sep 23, 2007; whats happening, captain

Go ahead and queue up the math/stats notation in code argument.

# ¿ Apr 10, 2016 17:59

vikingstrike: Sep 23, 2007; whats happening, captain

I write terrible scientific/data analysis code, but I'd like to think this thread has helped me clean it up a lot over the last couple of years.

# ¿ Apr 10, 2016 19:39

vikingstrike: Sep 23, 2007; whats happening, captain

Not sure what you're really looking for as the output goes, but is this similar to what you have in mind?

code:

df[['id_part1', 'id_part2']] = df.Unique_ID.str.split('_', expand=True)
df['num_id_part1'] = df.groupby('id_part1').id_part1.transform(len)
df.loc[(df.num_id_part1 > 1)&(df.id_part2.isnull()), 'ParentChild'] = 'Parent'
df.loc[(df.num_id_part1 > 1)&(df.id_part2.notnull()), 'ParentChild'] = 'Child'
df.drop(['id_part1', 'id_part2', 'num_id_part1'], axis=1, inplace=True)

code:

In [31]: df = pd.DataFrame(['423424234', '423424234_234234234234234', '97734950430'], columns=['Unique_ID'])

In [32]: df
Out[32]: 
                   Unique_ID
0                  423424234
1  423424234_234234234234234
2                97734950430

In [33]: %paste
df[['id_part1', 'id_part2']] = df.Unique_ID.str.split('_', expand=True)
df['num_id_part1'] = df.groupby('id_part1').id_part1.transform(len)
df.loc[(df.num_id_part1 > 1)&(df.id_part2.isnull()), 'ParentChild'] = 'Parent'
df.loc[(df.num_id_part1 > 1)&(df.id_part2.notnull()), 'ParentChild'] = 'Child'
df.drop(['id_part1', 'id_part2', 'num_id_part1'], axis=1, inplace=True)

## -- End pasted text --

In [34]: df
Out[34]: 
                   Unique_ID ParentChild
0                  423424234      Parent
1  423424234_234234234234234       Child
2                97734950430         NaN

vikingstrike fucked around with this message at 22:58 on Jul 1, 2016

# ¿ Jul 1, 2016 22:52

vikingstrike: Sep 23, 2007; whats happening, captain

I work with much larger data frames in pandas and for most things it's quick enough with the built in functions. Every now and then it will blow up and I'll get a bit more involved with coding what I need but the developers have really done a nice job over the last year or so at making it faster and using cython when possible.

# ¿ Jul 3, 2016 16:52

vikingstrike: Sep 23, 2007; whats happening, captain

I'll have to look closer when not on a phone but can definitely use & and | for row indexing in pandas. I'm pretty sure your .loc[] commands are slightly wrong from a quick glance.

Here's what you want

code:

df.loc[:, 'Current'] = df.loc[:, 'New']
df.loc[(df.Current.isnull())&(df.Old.notnull()), 'Current'] = df.loc[(df.Current.isnull())&(df.Old.notnull()), 'Old']

vikingstrike fucked around with this message at 20:57 on Nov 18, 2016

# ¿ Nov 18, 2016 20:07

vikingstrike: Sep 23, 2007; whats happening, captain

Eela6 posted:

You want to use numpy / MATLAB style logical indexing.

~~Remember not to use the bitwise operators like '&' unlesss you are actually working bitwise.~~ Apparently this is a difference between numpy and pandas

numpy has a number of formal logic operators that are what you want, called logical_and, logical_not, logical_xor, etc...

It's easiest to understand given an example. You might already know this, but it's always nice to have a refresher.

IN:
Python code:
A = np.array([2, 5, 8, 12, 20])
print(A)
between_twenty_and_three = np.logical_and(A>3, A<20)

print(between_twenty_and_three)
A[between_twenty_and_three] = 500

print(A)
OUT:
Python code:
[ 2  5  8 12 20]
[False  True  True  True False]
[  2 500 500 500  20]
Specifically, for your question:
IN:
Python code:
def update_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    df.Current = df.New
    df.Current[pd.isnull(df.New)] = df.Old[pd.isnull(df.New)]
    return df
    
def test_update_dataframe():
    df = pd.DataFrame([[1, 2,np.nan],
                   [3, 2,np.nan],
                   [7, np.nan,np.nan], 
                   [np.nan, 8,np.nan]],
                  columns=['Old', 'New', 'Current'])
    print('old')
    print(df)
    df = update_dataframe(df)
    print('new')
    print(df)    
    
test_update_dataframe()
OUT:
code:
old
   Old  New  Current
0  1.0  2.0      NaN
1  3.0  2.0      NaN
2  7.0  NaN      NaN
3  NaN  8.0      NaN
new
   Old  New  Current
0  1.0  2.0      2.0
1  3.0  2.0      2.0
2  7.0  NaN      7.0
3  NaN  8.0      8.0

BTW, touching on your post before the edit, I believe that this is something the pandas devs do on purpose to make it better align with other data analysis platforms.

# ¿ Nov 19, 2016 00:13

vikingstrike: Sep 23, 2007; whats happening, captain

Yeah, just use pandas. CSV is usually more convenient though, IME, especially with larger files since you'll likely never want to open them ... in Excel. But we work in real office places (I'm assuming some of us at least) and you're sent what you're sent. I usually will take a raw file like that, clean/sanity check it, and then write an HDF file with the type info.

# ¿ Nov 29, 2016 05:35

vikingstrike: Sep 23, 2007; whats happening, captain

For scipy I remember when building from source it can be really picky about the compiler version and flags that are being used. Not sure if this helps you any. :/

# ¿ Feb 4, 2017 15:25

vikingstrike: Sep 23, 2007; whats happening, captain

I'm trying to make the transition to python 3.6, and am wondering if there are any helpful resources online that summarize the big changes in python 3? I found the What's New section of the official documentation, but am looking for more of a Cliff Notes version until I have time to finish reading through the docs.

# ¿ Feb 7, 2017 21:49

vikingstrike: Sep 23, 2007; whats happening, captain

Dominoes posted:

....

Eela6 posted:

...

Thanks!

# ¿ Feb 9, 2017 23:15

vikingstrike: Sep 23, 2007; whats happening, captain

If you are going to be doing data analysis in python, you'll want to learn and be comfortable with pandas and numpy. Given you are coming from R I would guess pandas is exactly what you are looking for.

import pandas as pd
frame = pd.read_csv(my.csv)

And go to town.

# ¿ Feb 11, 2017 04:47

vikingstrike: Sep 23, 2007; whats happening, captain

Fusion Restaurant posted:

1. Do people use spaces or tabs to indent? I've noticed when copy pasting that sublime text is giving me tabs by default while Spyder is giving me spaces.

2. Also, a pandas Q which was a little too abstract for me to easily answer on stackexchange:
I have a bunch of csvs which I've downloaded from a site. Each is data for one week, all have the same columns, and I'd eventually like to concatenate them all into a big pandas dataframe.

Is the best way to do this to first clean each csv (which will involve dropping some columns, so making it smaller), and then merge?

Or should I just combine all the csvs into one giant csv, then turn it into a dataframe, and then clean that dataframe?

My concern is that the second method might leave me with a dataframe which is too big to really work with in memory (ie RAM) while I'm cleaning it and paring down the # of columns. Would the first approach actually be more memory efficient?

The last thing I guess I could do is to do the data cleaning on the csv's directly by editing the strings, or by reading them into a base python dictionary/list -- but I wasn't sure if that would actually save any memory? It would definitely be much more annoying.

e: Actually maybe what I really should get is a recommendation of a good guide to memory management in pandas. I'm pretty familiar with it in R, and how to efficiently do things w/in that language, but am totally lost in Python/pandas.

Why would it be that much different in python? Of course working on smaller chunks of a larger data set and then putting them together will lower the amount of RAM you need at any one time.

# ¿ Feb 13, 2017 18:58

vikingstrike: Sep 23, 2007; whats happening, captain

Merge in pandas has a "copy" parameter that controls that exact behavior. You'll want to set it to False.

# ¿ Feb 14, 2017 14:13

Adbot: ADBOT LOVES YOU

# ¿ May 9, 2024 12:20

vikingstrike: Sep 23, 2007; whats happening, captain

Just use the short blurbs y'all sent me when I asked a couple of weeks ago. Super short, no politics, etc.

# ¿ Mar 7, 2017 14:14

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›3 »