Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
vikingstrike
Sep 23, 2007

whats happening, captain

Does mypy have support for pandas objects? So I could set the return/input type to be pd.Series or pd.DataFrame?

Adbot
ADBOT LOVES YOU

vikingstrike
Sep 23, 2007

whats happening, captain

huhu posted:

Would this be the best way to log errors on a script that I'm running?

code:
try:
    # Threading here for 3 different functions
except Exception as e:     
    logf = open("log.txt", "a")
    logf.write("Error: {} \n".format(str(e)))
    logf.close()
finally:
    pass

Check out the logging module that comes in the standard library. It offers a cleaner interface for this type of stuff.

vikingstrike
Sep 23, 2007

whats happening, captain
No, turn away!

vikingstrike
Sep 23, 2007

whats happening, captain
If that's substitution is all you need to do, can you not just use the replace() string method? Do you need to use regex?

vikingstrike
Sep 23, 2007

whats happening, captain

Methanar posted:

The data is actually more structured like this. I gave the replace snippet a shot, but it didn't do anything.

code:

{ 
 "1": [
    "2.example.com",
    "2.example.com"
  ],
  "2": [
    "2.example.com"
  ],
  "3": [
    "3.example.com",
    "4.example.com"
  ]
} 

Now what I'm thinking is the json output is actually a string (I think), so maybe I can do my string manipulation after writing to json, but that's not going so well either.

code:
 
        groups = {key : filterIP(list(set(items))) for (key, items) in groups.iteritems() }

        s = self.json_format_dict(groups, pretty=True)
#       print(s)

        def filterSub(fullList2):
        return re.sub(r"example.com$", "sub.example.com", fullList2)

        print(filterSub(s))

If you iterating over a dictionary you'll need to update the comprehension accordingly. If your use case really is just the simple replace, I think regex is overkill here and makes for messier code for no reason.

vikingstrike
Sep 23, 2007

whats happening, captain
Thanks for the Fluent Python suggestion, thread. Started reading it last week and it's been really good.

vikingstrike
Sep 23, 2007

whats happening, captain

huhu posted:

In Pycharm, is there a way to run code and then and then leave some sort of a command line to continue playing around with it?


VisualStudio feels like the Photoshop of web development -- way too bloated. I realized that all the features I expected from PyCharm exist only in the professional version so now I've got myself a copy of that.

Highlight code, right click, execute selection in console.

vikingstrike
Sep 23, 2007

whats happening, captain
Take a look at pd.to_numeric()

vikingstrike
Sep 23, 2007

whats happening, captain
Didn't you say above you told the installer to not add anything to your PATH? Are you sure that pip working in that instance is not a carry over of a python installation pre-Anaconda?

vikingstrike
Sep 23, 2007

whats happening, captain
^^ Yep that would definitely be the easiest way to troubleshoot that.

vikingstrike
Sep 23, 2007

whats happening, captain
I also bet he's going to need to do manipulation/cleaning of the data before plotting which will be much easier in pandas, but sure reinvent the wheel.

vikingstrike
Sep 23, 2007

whats happening, captain

Hughmoris posted:

Speaking of Pandas, I run in to trouble when I need to create additional columns that are filled based on other column criteria. For example, if I have a CSV of:
code:
name,party_size,ticket_price
john,3,$14
sarah,1,$20
phil,6,$11
After I read that into Pandas, I then want to add two more columns. First column "More_Than_One" is Y/N based on party size being greater than 1. Next column is "Total_Cost" which is party_size * ticket_price.

How would I do something like that?

code:

import pandas as pd

frame = pd.read_csv('my_data.csv')

frame = frame.assign(More_Than_One=(frame.party_size > 1))
frame = frame.assign(Total_Cost=frame.party_size * frame.ticket_price)

vikingstrike
Sep 23, 2007

whats happening, captain

Jose posted:

Can anyone link a good guide for combining pandas and matplotlib? Basically how matplotlib usage differs if I'm using pandas data frames

Pandas has some plotting functions that will output matplotlib axes that you can tweak and save from there. Plot() is the main interface, but some others like hist() and boxplot() have one off functions. Like Cingulate said, seaborn is also a nice library that helps bridge these worlds and it is dataframe aware. Although in either case you might have to use a bit of matplotlib to make things exactly the way you want.

vikingstrike
Sep 23, 2007

whats happening, captain
I forget from your earlier posts, but have you posted any examples of code where you're getting stuck? This thread is quite willing to help people along if they know where to meet you.

vikingstrike
Sep 23, 2007

whats happening, captain
It can sure. You'd just have an empty value for that level, but to me that would be pretty weird to work with. You could also define the level you're struggling to map with an appropriate value. For example in your case it would be setting the value "Day" even though there is no variability for type 1. Other solutions would involve collapsing across levels through different naming or not using the extra level at all.

vikingstrike
Sep 23, 2007

whats happening, captain
I saw the new PyPy has pandas and numpy support. Anyone know of any benchmarking that's been done?

vikingstrike
Sep 23, 2007

whats happening, captain
You can use the debugger to stop at a point in the code, hit the console button and interact with the ipython window there (there is also a variables viewer too). Or you can highlight whatever code you want to run, and then right click and choose “execute in python console” and it will run it in an open console (if you have one) or open a new one and execute it there.

I used to do exactly what you describe with pycharm + a terminal window but I do it all in pycharm now.

I know this is short but I’m phone posting but can provide more detail later if needed.

vikingstrike
Sep 23, 2007

whats happening, captain
You’ve always been able to highlight code and execute it in the built in i python terminal of pycharm.

vikingstrike
Sep 23, 2007

whats happening, captain

Cingulate posted:

I found this very interesting: https://medium.com/dunder-data/python-for-data-analysis-a-critical-line-by-line-review-5d5678a4c203
A brutal review of Wes McKinney's book on pandas.

I think he nit picks here and there but overall I tend to agree with a lot of his points. I’ve never read his cookbook on pandas but maybe I’ll get work to buy it and I’ll thumb through it. I’ve used pandas enough now that all I usually need is a quick scan of the docs to remember things. However one thing he mentions early in his review that drives me up the loving wall is how pandas devs use all of these random weird functions and methods that aren’t really documented anywhere and when you’re learning the library it’s really hard to figure out what the hell they are for.

vikingstrike
Sep 23, 2007

whats happening, captain

Hughmoris posted:

For the Pandas users out there, what type of things (if any) do you bounce back to Excel for?

Nothing other than writing out tabular summary views of the data I’m working with (which I may clean up the formatting here and there) that are usually sent to coworkers, or opening up Excel files I’ve been sent to see how they’re formatted, so I know how to import them.

vikingstrike
Sep 23, 2007

whats happening, captain
x = {} creates a dictionary, x = set() if you want to create an empty set

vikingstrike
Sep 23, 2007

whats happening, captain
If you need to pull the value of the next row into the current row, then create a new column with shift(-1)?

vikingstrike
Sep 23, 2007

whats happening, captain
itertuples() is much quicker than iterrows(), and might be a nice middle ground.

vikingstrike
Sep 23, 2007

whats happening, captain
pd.concat(your list of series, axis=1) should do it.

vikingstrike
Sep 23, 2007

whats happening, captain

Jose Cuervo posted:

I am trying to understand what I am doing wrong with a Pandas merge of two data frames:

Python code:
merged_df = df_one.merge(df_two, how='left', on=['ID', 'Date'])
My understanding of this merge is that merged_df would have exactly the same number of rows as df_one because I am merging only on keys that exist in df_one (i.e., each ('ID', 'Date') tuple from df_one is iterated over and if that tuple exists as a row in df_two, then that data is copied over, and ('ID', 'Date') tuples in df_two that don't exist in df_one are ignored, correct?). However, when I run this code with my data, merged_df ends up having more rows than df_one has, and I don't know what I am doing wrong.

Is df_two unique on ID, Date?

vikingstrike
Sep 23, 2007

whats happening, captain

Jose Cuervo posted:

Are you asking if all the (ID, Date) tuple combinations in df_two are unique? If so, yes. df_two was generated using groupby where on=['ID', 'Date'].

Er, sorry. Unique was a bad word. In df_two does (ID, Date) ever include multiple rows?

vikingstrike
Sep 23, 2007

whats happening, captain

Jose Cuervo posted:

No, because df_two was generated using a groupby statement where on=['ID', 'Date'] - so every row in df_two corresponds to a unique ('ID', 'Date') tuple.

Try the indicator flag on the merge and then use it to see if it might lead you to where the extra rows are coming.

vikingstrike
Sep 23, 2007

whats happening, captain
Anybody had any issues with PyCharm skipping over breakpoints when debugging? My Google searches have failed me, and it's getting super annoying because I can't figure out how to replicate the issues.

vikingstrike
Sep 23, 2007

whats happening, captain
re.sub?

https://docs.python.org/3/library/re.html#re.sub

vikingstrike
Sep 23, 2007

whats happening, captain
What level of observation do you need the resultant data to be?

vikingstrike
Sep 23, 2007

whats happening, captain

Seventh Arrow posted:

Pretty detailed...this is the kind of analysis that I'll need to do on the data:
  • The president of company wants to understand which provinces and stores are performing well and how much are the top stores in each province performing compared with the average store of the province
  • The president further wants to know how customers in the loyalty program are performing compared to non-loyalty customers and what category of products is contributing to most of ACME’s sales
  • Determine the top 5 stores by province and top 10 product categories by department

Phone posting so I could be missing something, but those first three files look to have the same columns. If that’s the case, then concatenating the files together would work. Then you’d want to do two merges for the files below. One on store location key and the other on product key. If you need help, I can post pseudo code for you here in a bit when I can get back to a laptop. I would do all of this in pandas btw.

vikingstrike
Sep 23, 2007

whats happening, captain

Seventh Arrow posted:

If you could post an example of what you had in mind, it would be greatly appreciated. I have some other things that I can work on, so no rush.

Here's what I had in mind.

code:

import pandas as pd

# Take care of the first 3 CSV files
frame_a = pd.read_csv('csv_a.csv')  # I believe that 'NA' is already flagged as a missing value, so you should have it encoded properly. If not, look at na_values and na_filter parameters.
frame_b = pd.read_csv('csv_b.csv')
frame_c = pd.read_csv('csv_c.csv')
frame = pd.concat([frame_a, frame_b, frame_c])

# Now, merge in the location data
location_frame = pd.read_csv('locations.csv')
frame = frame.merge(location_frame, on='store_location_key', how='left')  # Want to do left joins here so as not to destroy any data from the A, B, and C files

# And the product data
product_frame = pd.read_csv('products.csv')
frame = frame.merge(product_frame, on='product_key', how='left')  # Same idea as above

# To fill 0s in where you need to for missing values
cols_to_fill_with_zero = ['here', 'are', 'my', 'column', 'names']
frame.loc[:, cols_to_fill_with_zerp'] = frame.loc[:, cols_to_fill_with_zerp'].fillna(0)

Obviously, I have no idea what your raw data actually look like, but for the first thing you mentioned, you could do something like:

code:
num_trans_per_store = (
    frame
    .groupby('store_location_key', as_index=False)  # For every store in the data
    .agg({'trans_id': 'nunique', 'region': 'first'})  # Tell me how many unique transaction ids there were, and what region they are in
)
num_trans_per_store = num_trans_per_store.assign(region_avg=frame.groupby('region').trans_id.transform('mean'))  # For each region calculate the average number of transactions of its stores
num_trans_per_store = num_trans_per_store.assign(store_diff_to_region=num_trans_per_store.trans_id - num_trans_per_store.region_avg)  # Calculate the difference in transaction of each store relative to its region's average

vikingstrike
Sep 23, 2007

whats happening, captain

Seventh Arrow posted:

That's great, thanks a lot. So I guess "concat" was what I was looking for when it comes to the similar csv files. Does it just automatically look at the column names and sort accordingly?

It's aligning the DataFrames along the columns index, which in this case is their names. So it doesn't matter what their position is, it matters that they are labeled the same. The default parameter is axis=0, which is over the columns (so you are appending DataFrames by stacking them on top of each other, and the column names are telling pandas which data goes where), but you could set it to axis=1, and think of the same exercise based on row indices, too. If one DataFrame has a column the others don't, pandas will create a new column, and assign missing values to the rows/pieces that didn't contain that variable.

quote:

Another bit of interest is the "frame.loc" line...so if I have multiple columns what would the format be like? Maybe something like:

code:
frame.loc[:, sales, units, etc_etc'] = frame.loc[:, sales, units, etc_etc'].fillna(0)
?

Whoops, there was a typo in my original code. It should read:

code:
# To fill 0s in where you need to for missing values
cols_to_fill_with_zero = ['here', 'are', 'my', 'column', 'names']
frame.loc[:, cols_to_fill_with_zero] = frame.loc[:, cols_to_fill_with_zero].fillna(0)
You are passing a list of column names: ['colA', 'colB', 'colC']. If you were only selecting a single column, you could pass just the name: .loc[:, 'colA']. Using .loc[row_indexer, columns_indexer] is important to pandas because it allows you to index DataFrames pretty flexibly. In my code, we are using .loc[:, cols_to_fill_with_zero] because we want to select all rows for these columns (cols_to_fill_with_zero).

To give you an idea of how you can build this into more complex expressions. Wonder if for region Canada, for all stores with a location id over 1000, I want to set the missing values in column 'price' to 'GOON', you could do:

code:
row_idx = (
    (frame.region == 'Canada') &
    (frame.price.isnull()) &
    (frame.store_location_key > 1000)
)
frame.loc[row_idx, 'price'] = 'GOON'

vikingstrike
Sep 23, 2007

whats happening, captain
Is there a reason you're using pyspark? Everything you mention can be done with pandas directly.

edit:

pandas automatically recognizes sales as a float column. See below.

code:
Python 3.6.3 |Anaconda, Inc.| (default, Oct  6 2017, 12:04:38)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: frame = pd.read_csv('trans_fact_4.csv')

In [3]: frame.head()
Out[3]:
   collector_key    trans_dt  store_location_key   product_key  sales  units  \
0           -1.0   6/26/2015                8142  4.319417e+09   9.42      1
1           -1.0  10/25/2015                8142  6.210700e+09  24.90      1
2           -1.0   9/18/2015                8142  5.873833e+09  12.09      1
3           -1.0   9/14/2015                8142  7.710581e+10  20.45      1
4           -1.0   4/18/2015                8142  5.610008e+09  10.31      1

      trans_key
0  1.694550e+25
1  3.400180e+25
2  1.727480e+25
3  4.145280e+24
4  2.641580e+25

In [4]: frame.dtypes
Out[4]:
collector_key         float64
trans_dt               object
store_location_key      int64
product_key           float64
sales                 float64
units                   int64
trans_key             float64
dtype: object

vikingstrike fucked around with this message at 19:48 on Feb 27, 2018

vikingstrike
Sep 23, 2007

whats happening, captain

Seventh Arrow posted:

Yes, I made a tiny little inconspicuous link in my post...here's the whole hog:

http://www.vaughn-s.net/hadoop/trans_fact_4.csv


For this assignment, I have to do it in pyspark - however, pyspark just uses all the same syntax as vanilla python. I just tried your examples in Spark and got the same results.
I guess the only question is whether I have to set up the dataframe using that convoluted setup that I used before, but I guess I don't!

If you need to create a DataFrame, and pyspark has a DataFrame creator function that gives you the desired output, I'm not sure why you'd try to roll your own. Turning CSVs into DataFrames is some of the most basic functionality of a library like this.

vikingstrike
Sep 23, 2007

whats happening, captain
One is for indexing one is for iterating.

vikingstrike
Sep 23, 2007

whats happening, captain
What type of reports are you thinking? Out of what you describe, the logical addition would be matplotlib/seaborn to plot figures in python.

vikingstrike
Sep 23, 2007

whats happening, captain
You can send email using python and write it in a way that the email should be in line. Been a while since I’ve done this, but this be easy to automate. Data cleaning -> figure generation -> email.

vikingstrike
Sep 23, 2007

whats happening, captain
Look at this loop and see if you can spot where you're tripping up:

code:
for i, rows in data[['collector_key', 'sales']].iterrows():
	collector_key = data[['collector_key']].astype(float)
	sales = data[['sales']]
			
	print("Adding data:", collector_key, sales)
		
	table.put_item(
		Item={
			'collector_key': collector_key,
			'sales': sales,
           }
       )
 

Adbot
ADBOT LOVES YOU

vikingstrike
Sep 23, 2007

whats happening, captain
Nope. It has to do with how you are first assigning collector key and sales.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply