Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
vikingstrike
Sep 23, 2007

whats happening, captain
Sitting down over a weekend and reading the docs and tweaking my matplotlibrc file to my liking was one of the best things I ever did. While I went through the docs, I created dummy template files that have the code for the most common plots I do, so now I just copy and paste it over and I'm off to the races. I tried out seaborn and some of the other graphics libraries, but just always came back to matplotlib.

Cingulate posted:

I think that's being a bit too down on matplotlib - you can still do reasonably pythonic stuff that would be rather different in Matlab. Like, iterating over axes to set their properties etc.

Yeah, this is really nice as well. Especially when you are handling multiple subplots in one figure.

Adbot
ADBOT LOVES YOU

vikingstrike
Sep 23, 2007

whats happening, captain

Opulent Ceremony posted:

What's 3 doing differently?

Treating range() like xrange() from python 2, I think.

vikingstrike
Sep 23, 2007

whats happening, captain

Amberskin posted:

Hello,

I'm taking a MOOC about machine learning which has python based lab exercises. My knowledge of python is quite basic, but I'm not having big trouble with the language. Actually, I like what I've seen...

The labs are done using iPython notebooks, against a configured virtual machine, so it is a "black box" for me: I just open the browser, point to the VM and start coding. My question is if there is some way to debug the code using iPhyton without resorting to the old and ugly practice of spamming the code with "print" statements. I've done some googling, but found nothing (I admit I haven't gone beyond the basic searchs). Could anyone point me to any doc/tutorial/example to teach myself how to do it?

I'd recommend starting by reading about the python debugger: https://docs.python.org/2/library/pdb.html It works just fine in iPython notebooks.

vikingstrike
Sep 23, 2007

whats happening, captain
If you don't have many labels, just pass them in explicitly and be done with it.

vikingstrike
Sep 23, 2007

whats happening, captain

Gothmog1065 posted:

Okay, I need some help, namely with the error that's being thrown. Basically link_list is a list of tuple pairs ( ex: list = [(1,2),(2,3)...] ), but I'm having issues removing them at the end. Here's my lovely code. I'm working through this myself, so I'm hoping to get that specific part working before I move on (then I'll get advice on how to be less stupid with my code).

Python code:
# n: the total number of nodes in the level, including the gateways
# l: the number of links
# e: the number of exit gateways
n, l, e = [int(i) for i in input().split()]
link_list = []
for i in range(l):
	# n1: N1 and N2 defines a link between these nodes
	n1, n2 = [int(j) for j in input().split()]
	link_list.append((n1,n2))
	
gateways = []
for i in range(e):
	ei = int(input())  # the index of a gateway node
	gateways.append(ei)

# game loop
while 1:
	si = int(input())  # The index of the node on which the Skynet agent is positioned this turn
	enemy = si #Let's use less stupid variable names
	for links in link_list:
		print(links[0],links[1],enemy,file=sys.stderr)
		if (links[0] == enemy) and (links[1] in gateways):
			disconnect = (enemy, links[1])
			#print("One", disconnect, file=sys.stderr)
			break
		elif (links[1] == enemy) and (links[0] in gateways):
			disconnect = (enemy, links[0])
			#print("Two", disconnect, file = sys.stderr)
			break
		else:
			disconnect = link_list[-1:]
			#print("Three",disconnect, file=sys.stderr)
			break
			
	link_list.remove(disconnect) #< -------- This is the line
	print(str(disconnect[0]) + " " + str(disconnect[1]))
The error being thrown is:

code:
ValueError: list.remove(x): x not in list
at Answer.py. in <module> on line 43
I went ahead and bolded the line that is having the issue.

Indentation error technically, but really a scope problem. You define "disconnect" at a level lower (inside the for loop) than what you are calling (inside the while loop).

vikingstrike
Sep 23, 2007

whats happening, captain

Thermopyle posted:

Because the vast majority of things people do with Python get no particular benefit out of numpy?

This. When I first started using python I had no needs outside of the standard library, but once I started doing my data analysis/cleaning/plotting in python I moved over to the Anaconda distribution since pandas, numpy, and MPL are daily drivers for me now. Once of the things I loved as a beginner to the language was how robust (at least to me) the standard library was in helping me do what I needed.

vikingstrike
Sep 23, 2007

whats happening, captain
I usually use a combination of requests and Beautiful Soup to download and scrape information from websites. Not sure if there is anything newer that has supplanted this. For working with tabular data I use pandas.

vikingstrike
Sep 23, 2007

whats happening, captain

Hughmoris posted:

I'm trying to explore and understand pandas and numpy. i have a data set that looks like this:
code:

UNIT        DISCHARGE DATE             DISCHARGE TO      DISCHARGE DELAY(hh:mm:ss)
CARD    10/01/2015 15:10:00    10/01/2015 06:51:42    8:18:18
NEUR    10/01/2015 10:15:00    10/01/2015 07:13:58    3:01:02
SURG    10/01/2015 09:30:00    10/01/2015 07:15:38    2:14:22
CARD    10/01/2015 11:23:00    10/01/2015 07:17:27    4:05:33
CARD    10/01/2015 15:20:00    10/01/2015 07:22:01    7:57:59
NEUR    10/01/2015 14:26:00    10/01/2015 07:23:12    7:02:48
...
...

Is there a simple way to get an average of "DISCHARGE DELAY" for each unit? I'd like to get the average "DISCHARGE DELAY" for each unit, then do a simple graph to display those averages.

Phone posting, but get discharge delay into numeric units and then use group by:

df.groupby("unit", sort=True)["discharge_delay"].mean()

This will find the mean discharge delay for each unit in the data. You can then take the series that's returned and easily plot a bar graph in matplotlib or the the built in pandas functions.

vikingstrike
Sep 23, 2007

whats happening, captain
Fwiw, Anaconda on Windows has been pretty hassle free for me and comes with a bunch of stuff pre installed (well at least everything I usually use regularly).

vikingstrike
Sep 23, 2007

whats happening, captain

politicorific posted:

Jesus christ...

So I tried New York, but if i put in Taipei... no dice.

url = 'http://aqicn.org/taiwan/songshan'

Also, Anaconda/Spyder is really finicky about lxml... half the time it cannot find "import lxml.html"

I've never had any issues like this with Anaconda before. Do you have multiple python installations on the same machine?

vikingstrike
Sep 23, 2007

whats happening, captain
Is morphology unique to each word (this may be stupid if you know the problem)? If so, couldn't you just merge the two frames on morphology and then keep the intersection?

pandas.merge(df1, df2, on="morphology", how="inner")

vikingstrike
Sep 23, 2007

whats happening, captain
Yeah, string.whitespace is a longer string than just "\t"

vikingstrike
Sep 23, 2007

whats happening, captain
You can use a list comprehension to do that cleaner. Or you could use a dict of lists.

vikingstrike
Sep 23, 2007

whats happening, captain

drainpipe posted:

I'm coding up the k-means clustering algorithm for a homework assignment. I'm given a list of data points (each with two coordinates) and I need to put them into k clusters. The way I'm doing it is to define k empty lists and fill them up with indices of my data points according to which cluster they belong to. I just realized that adding an extra column to the data array to indicate cluster assignment might work just as well.

edit: actually, that probably wouldn't be good since I need to calculate the means of a cluster and accessing the points of a cluster would be much harder if I had to read it from a field.

Have you looked into pandas? Stuff like this seems like it would be suited well. For example, to find the mean of a cluster is a simple groupby function call. Accessing the data points of a cluster is just indexing, etc.

vikingstrike
Sep 23, 2007

whats happening, captain

drainpipe posted:

Oh cool, I haven't looked at that. Are you talking about the DataFrames structure? I'm not familiar with Python and its packages so I'm using this class also as an opportunity to learn Python.

Yep.

vikingstrike
Sep 23, 2007

whats happening, captain
Go ahead and queue up the math/stats notation in code argument.

vikingstrike
Sep 23, 2007

whats happening, captain
I write terrible scientific/data analysis code, but I'd like to think this thread has helped me clean it up a lot over the last couple of years.

vikingstrike
Sep 23, 2007

whats happening, captain
Not sure what you're really looking for as the output goes, but is this similar to what you have in mind?

code:
df[['id_part1', 'id_part2']] = df.Unique_ID.str.split('_', expand=True)
df['num_id_part1'] = df.groupby('id_part1').id_part1.transform(len)
df.loc[(df.num_id_part1 > 1)&(df.id_part2.isnull()), 'ParentChild'] = 'Parent'
df.loc[(df.num_id_part1 > 1)&(df.id_part2.notnull()), 'ParentChild'] = 'Child'
df.drop(['id_part1', 'id_part2', 'num_id_part1'], axis=1, inplace=True)
code:
In [31]: df = pd.DataFrame(['423424234', '423424234_234234234234234', '97734950430'], columns=['Unique_ID'])

In [32]: df
Out[32]: 
                   Unique_ID
0                  423424234
1  423424234_234234234234234
2                97734950430

In [33]: %paste
df[['id_part1', 'id_part2']] = df.Unique_ID.str.split('_', expand=True)
df['num_id_part1'] = df.groupby('id_part1').id_part1.transform(len)
df.loc[(df.num_id_part1 > 1)&(df.id_part2.isnull()), 'ParentChild'] = 'Parent'
df.loc[(df.num_id_part1 > 1)&(df.id_part2.notnull()), 'ParentChild'] = 'Child'
df.drop(['id_part1', 'id_part2', 'num_id_part1'], axis=1, inplace=True)

## -- End pasted text --

In [34]: df
Out[34]: 
                   Unique_ID ParentChild
0                  423424234      Parent
1  423424234_234234234234234       Child
2                97734950430         NaN

vikingstrike fucked around with this message at 22:58 on Jul 1, 2016

vikingstrike
Sep 23, 2007

whats happening, captain
I work with much larger data frames in pandas and for most things it's quick enough with the built in functions. Every now and then it will blow up and I'll get a bit more involved with coding what I need but the developers have really done a nice job over the last year or so at making it faster and using cython when possible.

vikingstrike
Sep 23, 2007

whats happening, captain
I'll have to look closer when not on a phone but can definitely use & and | for row indexing in pandas. I'm pretty sure your .loc[] commands are slightly wrong from a quick glance.

Here's what you want
code:
df.loc[:, 'Current'] = df.loc[:, 'New']
df.loc[(df.Current.isnull())&(df.Old.notnull()), 'Current'] = df.loc[(df.Current.isnull())&(df.Old.notnull()), 'Old']

vikingstrike fucked around with this message at 20:57 on Nov 18, 2016

vikingstrike
Sep 23, 2007

whats happening, captain

Eela6 posted:

You want to use numpy / MATLAB style logical indexing.

Remember not to use the bitwise operators like '&' unlesss you are actually working bitwise. Apparently this is a difference between numpy and pandas

numpy has a number of formal logic operators that are what you want, called logical_and, logical_not, logical_xor, etc...

It's easiest to understand given an example. You might already know this, but it's always nice to have a refresher.

IN:
Python code:
A = np.array([2, 5, 8, 12, 20])
print(A)
between_twenty_and_three = np.logical_and(A>3, A<20)

print(between_twenty_and_three)
A[between_twenty_and_three] = 500

print(A)
OUT:
Python code:
[ 2  5  8 12 20]
[False  True  True  True False]
[  2 500 500 500  20]
Specifically, for your question:
IN:
Python code:
def update_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    df.Current = df.New
    df.Current[pd.isnull(df.New)] = df.Old[pd.isnull(df.New)]
    return df
    
def test_update_dataframe():
    df = pd.DataFrame([[1, 2,np.nan],
                   [3, 2,np.nan],
                   [7, np.nan,np.nan], 
                   [np.nan, 8,np.nan]],
                  columns=['Old', 'New', 'Current'])
    print('old')
    print(df)
    df = update_dataframe(df)
    print('new')
    print(df)    
    
test_update_dataframe()
OUT:
code:
old
   Old  New  Current
0  1.0  2.0      NaN
1  3.0  2.0      NaN
2  7.0  NaN      NaN
3  NaN  8.0      NaN
new
   Old  New  Current
0  1.0  2.0      2.0
1  3.0  2.0      2.0
2  7.0  NaN      7.0
3  NaN  8.0      8.0

BTW, touching on your post before the edit, I believe that this is something the pandas devs do on purpose to make it better align with other data analysis platforms.

vikingstrike
Sep 23, 2007

whats happening, captain
Yeah, just use pandas. CSV is usually more convenient though, IME, especially with larger files since you'll likely never want to open them ... in Excel. But we work in real office places (I'm assuming some of us at least) and you're sent what you're sent. I usually will take a raw file like that, clean/sanity check it, and then write an HDF file with the type info.

vikingstrike
Sep 23, 2007

whats happening, captain
For scipy I remember when building from source it can be really picky about the compiler version and flags that are being used. Not sure if this helps you any. :/

vikingstrike
Sep 23, 2007

whats happening, captain
I'm trying to make the transition to python 3.6, and am wondering if there are any helpful resources online that summarize the big changes in python 3? I found the What's New section of the official documentation, but am looking for more of a Cliff Notes version until I have time to finish reading through the docs.

vikingstrike
Sep 23, 2007

whats happening, captain



Thanks!

vikingstrike
Sep 23, 2007

whats happening, captain
If you are going to be doing data analysis in python, you'll want to learn and be comfortable with pandas and numpy. Given you are coming from R I would guess pandas is exactly what you are looking for.

import pandas as pd
frame = pd.read_csv(my.csv)

And go to town.

vikingstrike
Sep 23, 2007

whats happening, captain

Fusion Restaurant posted:

1. Do people use spaces or tabs to indent? I've noticed when copy pasting that sublime text is giving me tabs by default while Spyder is giving me spaces.

2. Also, a pandas Q which was a little too abstract for me to easily answer on stackexchange:
I have a bunch of csvs which I've downloaded from a site. Each is data for one week, all have the same columns, and I'd eventually like to concatenate them all into a big pandas dataframe.

Is the best way to do this to first clean each csv (which will involve dropping some columns, so making it smaller), and then merge?

Or should I just combine all the csvs into one giant csv, then turn it into a dataframe, and then clean that dataframe?

My concern is that the second method might leave me with a dataframe which is too big to really work with in memory (ie RAM) while I'm cleaning it and paring down the # of columns. Would the first approach actually be more memory efficient?

The last thing I guess I could do is to do the data cleaning on the csv's directly by editing the strings, or by reading them into a base python dictionary/list -- but I wasn't sure if that would actually save any memory? It would definitely be much more annoying.

e: Actually maybe what I really should get is a recommendation of a good guide to memory management in pandas. I'm pretty familiar with it in R, and how to efficiently do things w/in that language, but am totally lost in Python/pandas.

Why would it be that much different in python? Of course working on smaller chunks of a larger data set and then putting them together will lower the amount of RAM you need at any one time.

vikingstrike
Sep 23, 2007

whats happening, captain
Merge in pandas has a "copy" parameter that controls that exact behavior. You'll want to set it to False.

Adbot
ADBOT LOVES YOU

vikingstrike
Sep 23, 2007

whats happening, captain
Just use the short blurbs y'all sent me when I asked a couple of weeks ago. Super short, no politics, etc.

  • Locked thread