Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

I usually deploy to Heroku where you spin up dynos for multiple workers. No messing with systemd there so I can't help you with that.

I don't think there'd be anything specific to RQ...it's just another process for systemd to manage.

Adbot
ADBOT LOVES YOU

mr_package
Jun 13, 2000
Yeah migrating all this stuff to containers is on my to-do list for sure, but right now it's old school Linux admin style.

Boris Galerkin
Dec 17, 2011

I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!

Cingulate posted:

You can try this stackexchange answer:
https://stackoverflow.com/questions/6309472/matplotlib-can-i-create-axessubplot-objects-then-add-them-to-a-figure-instance/46906599#46906599

But my suggestion would be to write a function that creates your axis, and give it an axis parameter. Then you call it once for its own figure, and another time for the joint figure. Much less awkward.

Yeah, I saw that post and I think that I want to do the same thing that he wanted to do, but I saw that he said it's not possible in an update. Was just hoping that he was wrong and/or new updates make it possible but I guess not.


Anyway on an unrelated topic, does anyone use Jupyter notebooks? I feel like Jupyter notebooks are one of these things that I've always heard about and people rave about it, like Docker, but I don't really "get" it so I'm having a hard time seeing use cases for them. But I'll admit that once I actually sat down and understood Docker sometime last year I went from "meh okay" to "holy poo poo, this is awesome" real fast. I'm hoping the same thing happens with Jupyter notebooks.

Cingulate
Oct 23, 2012

by Fluffdaddy
Generally, if you want to do something in Python and it’s awkward, more often than not you want to do the wrong thing. Just try it with a function, like I said.

Everyone and their moms uses notebooks. When I want to multiply two numbers and one of them has more than 3 digits, I open a notebook.
That is, if you analyse data.

Boris Galerkin
Dec 17, 2011

I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!

Cingulate posted:

Generally, if you want to do something in Python and it’s awkward, more often than not you want to do the wrong thing. Just try it with a function, like I said.

Everyone and their moms uses notebooks. When I want to multiply two numbers and one of them has more than 3 digits, I open a notebook.
That is, if you analyse data.

About the notebooks: I guess I’m asking why would you do that? What I would do, if I really needed to do this, is just open up a new blank script in PyCharm, type in my multiplication and hit f5 or whatever the hotkey is to run the script. Like I said Jupyter Notebooks seems like a great tool but I just don’t get it. Most of the stuff I’m finding online remind me of when I didn’t get Docker: lots of people saying how great they are but nobody really “showing” how great they are.

Cingulate
Oct 23, 2012

by Fluffdaddy
The basic math thing was hyperbole.

Well, what is it that you need to do? Complex plotting things, with multiple open figures, a re a great scenario for notebooks. If you frequently get back to the data itself, even better.

In other contexts, other tools will be superior. It depends.

Boris Galerkin
Dec 17, 2011

I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!
I guess I’m asking a question that can’t really be answered so never mind, I’ll just keep using it a bit more and see for myself. I had to make some new plots today so I used a notebook for that. I found that using it in the browser wasn’t really nice because I’m just so used to PyCharm’s autocompletion for everything. PyCharm lets me attach a notebook to a running server too, so that worked better and it gave me autocomplete. But still, I coulda done the same thing with just a script.

e: Whelp, I just searched for something and this link came up: https://stackoverflow.com/a/38192558 and that's actually really insightful for me. I didn't think of using notebooks like that.

Boris Galerkin fucked around with this message at 19:20 on Feb 22, 2018

Ghost of Reagan Past
Oct 7, 2003

rock and roll fun

Boris Galerkin posted:

Anyway on an unrelated topic, does anyone use Jupyter notebooks? I feel like Jupyter notebooks are one of these things that I've always heard about and people rave about it, like Docker, but I don't really "get" it so I'm having a hard time seeing use cases for them. But I'll admit that once I actually sat down and understood Docker sometime last year I went from "meh okay" to "holy poo poo, this is awesome" real fast. I'm hoping the same thing happens with Jupyter notebooks.
They're great for manipulation of data. I can make charts, do exploratory analysis, write code that I'll refactor later, throwaway experiments, etc. Plus, include prose and other stuff to illuminate what I'm actually doing.

They're less good for most other coding things.

There are obviously holdouts but if you work with data in Python, you're going to use notebooks.

QuarkJets
Sep 8, 2008

Boris Galerkin posted:

About the notebooks: I guess I’m asking why would you do that? What I would do, if I really needed to do this, is just open up a new blank script in PyCharm, type in my multiplication and hit f5 or whatever the hotkey is to run the script. Like I said Jupyter Notebooks seems like a great tool but I just don’t get it. Most of the stuff I’m finding online remind me of when I didn’t get Docker: lots of people saying how great they are but nobody really “showing” how great they are.

I think if you're already a PyCharm user then the only other reason to open a notebook is if you plan to share the results (not just code) with someone else.

Sad Panda
Sep 22, 2004

I'm a Sad Panda.
I'm on OS X. I installed Anaconda thinking it'd be good to keep all the things installed organised. I'm trying to install pygame and it's failing all over the shop.

conda install pygame finds nothing, so I went and found https://anaconda.org/search?q=platform%3Aosx-64+pygame
Tried each of those and it says...

quote:

OBOW:~ obow$ conda install -c kne pygame
Solving environment: failed

UnsatisfiableError: The following specifications were found to be in conflict:
- pygame
- traitlets
Use "conda info <package>" to see the dependencies for each package.

OBOW:~ obow$ conda install -c quasiben pygame
Solving environment: failed

UnsatisfiableError: The following specifications were found to be in conflict:
- pygame
- traitlets
Use "conda info <package>" to see the dependencies for each package.

OBOW:~ obow$ conda install -c maceo57 pygame
Solving environment: failed

UnsatisfiableError: The following specifications were found to be in conflict:
- pygame
- traitlets
Use "conda info <package>" to see the dependencies for each package.

OBOW:~ obow$ conda install -c derickl pygame
Solving environment: failed

UnsatisfiableError: The following specifications were found to be in conflict:
- pygame
- traitlets
Use "conda info <package>" to see the dependencies for each package.

I can install it with pip3 but then that's not found in Conda.

https://conda.io/docs/user-guide/tasks/manage-pkgs.html suggests using an env but...

quote:

OBOW:~ obow$ source activate base
(base) OBOW:~ obow$ pip3 install pygame
Requirement already satisfied: pygame in /usr/local/lib/python3.6/site-packages
(base) OBOW:~ obow$ python3
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pygame
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pygame'

SurgicalOntologist
Jun 17, 2004

After you activate, use "pip" instead of "pip3". Activating conda environments points pip to the right executable, but doesn't bother with pip3, which stays connected to your system Python.

Sad Panda
Sep 22, 2004

I'm a Sad Panda.

SurgicalOntologist posted:

After you activate, use "pip" instead of "pip3". Activating conda environments points pip to the right executable, but doesn't bother with pip3, which stays connected to your system Python.

Thank you!

quote:

OBOW:alien_invasion obow$ conda activate base
(base) OBOW:alien_invasion obow$ pip install pygame
Collecting pygame
Using cached pygame-1.9.3-cp36-cp36m-macosx_10_9_intel.whl
Installing collected packages: pygame
Successfully installed pygame-1.9.3
(base) OBOW:alien_invasion obow$ python3
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pygame
>>>

SurgicalOntologist
Jun 17, 2004

Also, although it appears that python3 points to the right place, I would just use the plain python executable. Once you've activated an env, further disambiguation is not necessary. Executables like python2, python3, pip3, etc. are awkward workarounds for working with multiple pythons without environments. With environments you don't need to worry about that and should use the canonical version of everything, IMO.

Seventh Arrow
Jan 26, 2005

This is for pySpark, but the syntax should still be the same. I just realized that I know how to search for text, but not replace it. I need to look through a bunch of spreadsheets and replace any "NA" values with zeroes. So I thought about maybe looking for them with a regex:

code:
from pyspark import SparkContext
import re

text = sc.textFile('/documents/spreadsheets/*')
text_f = text.filter(lambda line: re.search(r'insert_the_regex_that_I_havent_worked_out_yet', line))
It seems like the next step should be obvious, but I'm drawing a blank. I'm aware of 'na.fill' but it seem to me like it looks specifically for 'null'.

Seventh Arrow
Jan 26, 2005

Actually, maybe map will do it...I just need to figure out the syntax. Maybe:

code:
from pyspark import SparkContext
import re

text = sc.textFile('/documents/spreadsheets/*')
text_f = text.filter(lambda line: re.search(r'insert_the_regex_that_I_havent_worked_out_yet', line).map(lambda x: x = 0))

vikingstrike
Sep 23, 2007

whats happening, captain
re.sub?

https://docs.python.org/3/library/re.html#re.sub

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

map is for transforming a bunch of things (it maps values to other values, basically). So if you have a set of elements, but you want to change the NAs to 0s, your mapping function wants to output a 0 if it gets one of those, and just pass everything else through as-is (or map the value to itself, if you like)

Python code:

lambda x: '0' if x == 'NA' else x

that way the sequence of elements you get out is the same as what you put in - you've just replaced some of them with different values.

What your code does is filter out all the non-NA lines (right?), leaving you with only a bunch of NAs, and then you turn those into a bunch of 0s instead - which isn't very useful! You want to keep all the elements, but use map to selectively change some as they pass through

Using a regex find and replace might be a lot better anyway, I just wanted to point out what the functional stuff is about

baka kaba fucked around with this message at 02:43 on Feb 26, 2018

Seventh Arrow
Jan 26, 2005

Yeah, thanks for that. I took a closer look and I might have to convert everything to a dataframe anyways. Also I noticed that out of the 10 csv files I need to combine, they all have columns that are set up differently than the others. This might take a lot of joins to get it to work.

Seventh Arrow
Jan 26, 2005

Ok so now I'm wondering if using join will get all of these csv files into a nice little pile. I need to combine multiple csv files into one object (a dataframe, I assume) but they all have mismatched columns, like so:

CSV A

store_location_key | product_key | collector_key | trans_dt | sales | units | trans_key

CSV B

collector_key | trans_dt | store_location_key | product_key | sales | units | trans_key

CSV C

collector_key | trans_dt | store_location_key |product_key | sales | units | trans_id

On top of that, I need these to match with two additional csv files that have a matching column:

Location CSV

store_location_key | region | province | city | postal_code | banner | store_num

Product CSV

product_key | sku | item_name | item_description | department | category

The data types are all consistent, i.e., sales is always float, store_location_key is always int, etc. Even if I convert each csv to a dataframe first, I'm not sure that a join would work (except for the last two) because of the way that the columns need to match up. Any ideas?

vikingstrike
Sep 23, 2007

whats happening, captain
What level of observation do you need the resultant data to be?

Seventh Arrow
Jan 26, 2005

Pretty detailed...this is the kind of analysis that I'll need to do on the data:
  • The president of company wants to understand which provinces and stores are performing well and how much are the top stores in each province performing compared with the average store of the province
  • The president further wants to know how customers in the loyalty program are performing compared to non-loyalty customers and what category of products is contributing to most of ACME’s sales
  • Determine the top 5 stores by province and top 10 product categories by department

vikingstrike
Sep 23, 2007

whats happening, captain

Seventh Arrow posted:

Pretty detailed...this is the kind of analysis that I'll need to do on the data:
  • The president of company wants to understand which provinces and stores are performing well and how much are the top stores in each province performing compared with the average store of the province
  • The president further wants to know how customers in the loyalty program are performing compared to non-loyalty customers and what category of products is contributing to most of ACME’s sales
  • Determine the top 5 stores by province and top 10 product categories by department

Phone posting so I could be missing something, but those first three files look to have the same columns. If that’s the case, then concatenating the files together would work. Then you’d want to do two merges for the files below. One on store location key and the other on product key. If you need help, I can post pseudo code for you here in a bit when I can get back to a laptop. I would do all of this in pandas btw.

Seventh Arrow
Jan 26, 2005

If you could post an example of what you had in mind, it would be greatly appreciated. I have some other things that I can work on, so no rush.

vikingstrike
Sep 23, 2007

whats happening, captain

Seventh Arrow posted:

If you could post an example of what you had in mind, it would be greatly appreciated. I have some other things that I can work on, so no rush.

Here's what I had in mind.

code:

import pandas as pd

# Take care of the first 3 CSV files
frame_a = pd.read_csv('csv_a.csv')  # I believe that 'NA' is already flagged as a missing value, so you should have it encoded properly. If not, look at na_values and na_filter parameters.
frame_b = pd.read_csv('csv_b.csv')
frame_c = pd.read_csv('csv_c.csv')
frame = pd.concat([frame_a, frame_b, frame_c])

# Now, merge in the location data
location_frame = pd.read_csv('locations.csv')
frame = frame.merge(location_frame, on='store_location_key', how='left')  # Want to do left joins here so as not to destroy any data from the A, B, and C files

# And the product data
product_frame = pd.read_csv('products.csv')
frame = frame.merge(product_frame, on='product_key', how='left')  # Same idea as above

# To fill 0s in where you need to for missing values
cols_to_fill_with_zero = ['here', 'are', 'my', 'column', 'names']
frame.loc[:, cols_to_fill_with_zerp'] = frame.loc[:, cols_to_fill_with_zerp'].fillna(0)

Obviously, I have no idea what your raw data actually look like, but for the first thing you mentioned, you could do something like:

code:
num_trans_per_store = (
    frame
    .groupby('store_location_key', as_index=False)  # For every store in the data
    .agg({'trans_id': 'nunique', 'region': 'first'})  # Tell me how many unique transaction ids there were, and what region they are in
)
num_trans_per_store = num_trans_per_store.assign(region_avg=frame.groupby('region').trans_id.transform('mean'))  # For each region calculate the average number of transactions of its stores
num_trans_per_store = num_trans_per_store.assign(store_diff_to_region=num_trans_per_store.trans_id - num_trans_per_store.region_avg)  # Calculate the difference in transaction of each store relative to its region's average

Seventh Arrow
Jan 26, 2005

vikingstrike posted:

Here's what I had in mind.

code:

import pandas as pd

# Take care of the first 3 CSV files
frame_a = pd.read_csv('csv_a.csv')  # I believe that 'NA' is already flagged as a missing value, so you should have it encoded properly. If not, look at na_values and na_filter parameters.
frame_b = pd.read_csv('csv_b.csv')
frame_c = pd.read_csv('csv_c.csv')
frame = pd.concat([frame_a, frame_b, frame_c])

# Now, merge in the location data
location_frame = pd.read_csv('locations.csv')
frame = frame.merge(location_frame, on='store_location_key', how='left')  # Want to do left joins here so as not to destroy any data from the A, B, and C files

# And the product data
product_frame = pd.read_csv('products.csv')
frame = frame.merge(product_frame, on='product_key', how='left')  # Same idea as above

# To fill 0s in where you need to for missing values
cols_to_fill_with_zero = ['here', 'are', 'my', 'column', 'names']
frame.loc[:, cols_to_fill_with_zerp'] = frame.loc[:, cols_to_fill_with_zerp'].fillna(0)

Obviously, I have no idea what your raw data actually look like, but for the first thing you mentioned, you could do something like:

code:
num_trans_per_store = (
    frame
    .groupby('store_location_key', as_index=False)  # For every store in the data
    .agg({'trans_id': 'nunique', 'region': 'first'})  # Tell me how many unique transaction ids there were, and what region they are in
)
num_trans_per_store = num_trans_per_store.assign(region_avg=frame.groupby('region').trans_id.transform('mean'))  # For each region calculate the average number of transactions of its stores
num_trans_per_store = num_trans_per_store.assign(store_diff_to_region=num_trans_per_store.trans_id - num_trans_per_store.region_avg)  # Calculate the difference in transaction of each store relative to its region's average


That's great, thanks a lot. So I guess "concat" was what I was looking for when it comes to the similar csv files. Does it just automatically look at the column names and sort accordingly?

Another bit of interest is the "frame.loc" line...so if I have multiple columns what would the format be like? Maybe something like:

code:
frame.loc[:, sales, units, etc_etc'] = frame.loc[:, sales, units, etc_etc'].fillna(0)
?

vikingstrike
Sep 23, 2007

whats happening, captain

Seventh Arrow posted:

That's great, thanks a lot. So I guess "concat" was what I was looking for when it comes to the similar csv files. Does it just automatically look at the column names and sort accordingly?

It's aligning the DataFrames along the columns index, which in this case is their names. So it doesn't matter what their position is, it matters that they are labeled the same. The default parameter is axis=0, which is over the columns (so you are appending DataFrames by stacking them on top of each other, and the column names are telling pandas which data goes where), but you could set it to axis=1, and think of the same exercise based on row indices, too. If one DataFrame has a column the others don't, pandas will create a new column, and assign missing values to the rows/pieces that didn't contain that variable.

quote:

Another bit of interest is the "frame.loc" line...so if I have multiple columns what would the format be like? Maybe something like:

code:
frame.loc[:, sales, units, etc_etc'] = frame.loc[:, sales, units, etc_etc'].fillna(0)
?

Whoops, there was a typo in my original code. It should read:

code:
# To fill 0s in where you need to for missing values
cols_to_fill_with_zero = ['here', 'are', 'my', 'column', 'names']
frame.loc[:, cols_to_fill_with_zero] = frame.loc[:, cols_to_fill_with_zero].fillna(0)
You are passing a list of column names: ['colA', 'colB', 'colC']. If you were only selecting a single column, you could pass just the name: .loc[:, 'colA']. Using .loc[row_indexer, columns_indexer] is important to pandas because it allows you to index DataFrames pretty flexibly. In my code, we are using .loc[:, cols_to_fill_with_zero] because we want to select all rows for these columns (cols_to_fill_with_zero).

To give you an idea of how you can build this into more complex expressions. Wonder if for region Canada, for all stores with a location id over 1000, I want to set the missing values in column 'price' to 'GOON', you could do:

code:
row_idx = (
    (frame.region == 'Canada') &
    (frame.price.isnull()) &
    (frame.store_location_key > 1000)
)
frame.loc[row_idx, 'price'] = 'GOON'

Seventh Arrow
Jan 26, 2005

Ok this is the last time I make a nuisance of myself with this dataframe/csv stuff (I hope). There's a few csv files in the set that do not take kindly to having the 'sales' column being classified with a 'float' data type. This is the code that I'm using to convert the files to csv...it's a bit convoluted, but it strips the header and uses its data to form the schema:

code:
import pandas as pd
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)

text_b = sc.textFile('/home/seventh_arrow/Documents/B/file_4.csv')
text_b.count()
header_b = text_b.first()
fields_b = [StructField(field_name, StringType(), True) for field_name in header_b.split(',')]
fields_b[0].dataType = StringType()
fields_b[1].dataType = StringType()
fields_b[2].dataType = IntegerType()
fields_b[3].dataType = StringType()
fields_b[4].dataType = FloatType()
fields_b[5].dataType = IntegerType()
fields_b[6].dataType = StringType()
schema_b = StructType(fields_b)
b_header = text_b.filter(lambda l: "collector_key" in l)
b_noheader = text_b.subtract(b_header)
b_noheader.count()
b_temp = b_noheader.map(lambda k: k.split(",")).map(lambda x: (x[0], x[1], int(x[2]), x[3], float(x[4]), int(x[5]), x[6] ))
b_temp.top(2)
b_df = sqlContext.createDataFrame(b_temp, schema_b)
b_df.head(10)
When I run "b_temp.top(2)", I get the error "valueerror could not convert string to float."

It works A-OK with the first three csv files, so I'm thinking there has to be something within file_4 that doesn't conform to float specifics...maybe there's some extra whitespace or something (if you're really curious, you can see the file here). This column needs calculations done on it so I can't cheat and classify it as a string.

So I guess I'm asking:

Is there a way to scan the csv for the problem cell(s)? Or to maybe make the whole column conform to float?

I saw on one webpage that you can normalize the data types with

code:
data = pd.read_csv(‘/home/seventh_arrow/Documents/B/file_4.csv’, dtype={‘sales’: float})
But it doesn't really work with the way the rest of the code above works. Sorry, I'm bad at python :eng99:

Cingulate
Oct 23, 2012

by Fluffdaddy
Can you show what the file looks like? I am pretty optimistic this can be solved with 3 lines of pandas.

vikingstrike
Sep 23, 2007

whats happening, captain
Is there a reason you're using pyspark? Everything you mention can be done with pandas directly.

edit:

pandas automatically recognizes sales as a float column. See below.

code:
Python 3.6.3 |Anaconda, Inc.| (default, Oct  6 2017, 12:04:38)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: frame = pd.read_csv('trans_fact_4.csv')

In [3]: frame.head()
Out[3]:
   collector_key    trans_dt  store_location_key   product_key  sales  units  \
0           -1.0   6/26/2015                8142  4.319417e+09   9.42      1
1           -1.0  10/25/2015                8142  6.210700e+09  24.90      1
2           -1.0   9/18/2015                8142  5.873833e+09  12.09      1
3           -1.0   9/14/2015                8142  7.710581e+10  20.45      1
4           -1.0   4/18/2015                8142  5.610008e+09  10.31      1

      trans_key
0  1.694550e+25
1  3.400180e+25
2  1.727480e+25
3  4.145280e+24
4  2.641580e+25

In [4]: frame.dtypes
Out[4]:
collector_key         float64
trans_dt               object
store_location_key      int64
product_key           float64
sales                 float64
units                   int64
trans_key             float64
dtype: object

vikingstrike fucked around with this message at 19:48 on Feb 27, 2018

Seventh Arrow
Jan 26, 2005

Cingulate posted:

Can you show what the file looks like? I am pretty optimistic this can be solved with 3 lines of pandas.

Yes, I made a tiny little inconspicuous link in my post...here's the whole hog:

http://www.vaughn-s.net/hadoop/trans_fact_4.csv

vikingstrike posted:

Is there a reason you're using pyspark? Everything you mention can be done with pandas directly.

edit:

pandas automatically recognizes sales as a float column. See below.

code:
Python 3.6.3 |Anaconda, Inc.| (default, Oct  6 2017, 12:04:38)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: frame = pd.read_csv('trans_fact_4.csv')

In [3]: frame.head()
Out[3]:
   collector_key    trans_dt  store_location_key   product_key  sales  units  \
0           -1.0   6/26/2015                8142  4.319417e+09   9.42      1
1           -1.0  10/25/2015                8142  6.210700e+09  24.90      1
2           -1.0   9/18/2015                8142  5.873833e+09  12.09      1
3           -1.0   9/14/2015                8142  7.710581e+10  20.45      1
4           -1.0   4/18/2015                8142  5.610008e+09  10.31      1

      trans_key
0  1.694550e+25
1  3.400180e+25
2  1.727480e+25
3  4.145280e+24
4  2.641580e+25

In [4]: frame.dtypes
Out[4]:
collector_key         float64
trans_dt               object
store_location_key      int64
product_key           float64
sales                 float64
units                   int64
trans_key             float64
dtype: object

For this assignment, I have to do it in pyspark - however, pyspark just uses all the same syntax as vanilla python. I just tried your examples in Spark and got the same results.
I guess the only question is whether I have to set up the dataframe using that convoluted setup that I used before, but I guess I don't!

vikingstrike
Sep 23, 2007

whats happening, captain

Seventh Arrow posted:

Yes, I made a tiny little inconspicuous link in my post...here's the whole hog:

http://www.vaughn-s.net/hadoop/trans_fact_4.csv


For this assignment, I have to do it in pyspark - however, pyspark just uses all the same syntax as vanilla python. I just tried your examples in Spark and got the same results.
I guess the only question is whether I have to set up the dataframe using that convoluted setup that I used before, but I guess I don't!

If you need to create a DataFrame, and pyspark has a DataFrame creator function that gives you the desired output, I'm not sure why you'd try to roll your own. Turning CSVs into DataFrames is some of the most basic functionality of a library like this.

Seventh Arrow
Jan 26, 2005

vikingstrike posted:

If you need to create a DataFrame, and pyspark has a DataFrame creator function that gives you the desired output, I'm not sure why you'd try to roll your own. Turning CSVs into DataFrames is some of the most basic functionality of a library like this.

Well, I tried googling on my own to try to solve the problem myself and came across this page:

https://www.nodalpoint.com/spark-data-frames-from-csv-files-handling-headers-column-types/

It was fascinating for me, but you're right...it's a lot of busywork for something that can be done in a more simple manner.

This is actually kinda-sorta for a job interview. The teacher at the place where I study data engineering was privy to the "skill-testing exercises" that an employer uses and said that if I can solve them, he will try to get me a job interview. I feel bad that it's giving me so much trouble; I mean, data analysis isn't usually in an engineer's job description, but job descriptions in Big Data aren't an exact science right now. Anyways, I hope that I'm becoming a better programmer and thanks for the assist!

Cingulate
Oct 23, 2012

by Fluffdaddy
I'm trying to look up what a movie professionals primary occupation is.

Python code:
searchstr = "nconst == @person"
find = lambda person: df_movies.query(searchstr)["category"].mode()[0]
df_actors["main"] = df_actors["nconst"].map(find)
("nconst" is an identifier, "category" is the job on that movie)

It's not too slow, but could it go faster?

Edit: obvious suggestion would be kicking out all the porn movies.

edit 2: Oh wow, I switched to df.groupby and now it's much faster.

Cingulate fucked around with this message at 20:37 on Feb 28, 2018

PoizenJam
Dec 2, 2006

Damn!!!
It's PoizenJam!!!
In list logic, .remove will remove the first item in a given list that matches the query...

How do you delete the last item in a list that matches a particular query?

Because the best I can come up with is to reverse the list, apply the remove function, then reverse the list again. And that strikes me as terribly inefficient :v:

huhu
Feb 24, 2006

JVNO posted:

In list logic, .remove will remove the first item in a given list that matches the query...

How do you delete the last item in a list that matches a particular query?

Because the best I can come up with is to reverse the list, apply the remove function, then reverse the list again. And that strikes me as terribly inefficient :v:

You could iterate in reverse.

Wallet
Jun 19, 2006

JVNO posted:

In list logic, .remove will remove the first item in a given list that matches the query...

How do you delete the last item in a list that matches a particular query?

Because the best I can come up with is to reverse the list, apply the remove function, then reverse the list again. And that strikes me as terribly inefficient :v:

You can use del to remove an item from a list by index (del list[0]), which might help, depending on what you're actually trying to do.

Eela6
May 25, 2007
Shredded Hen

JVNO posted:

In list logic, .remove will remove the first item in a given list that matches the query...

How do you delete the last item in a list that matches a particular query?

Because the best I can come up with is to reverse the list, apply the remove function, then reverse the list again. And that strikes me as terribly inefficient :v:

I would do it like this:

Use reversed() to iterate over the list in reverse (i.e, from back to front) without modifying the list.

Find the index to remove, then use the del statement to remove that element of the list.

Note that removing elements from the middle of a list is not a particularly efficient operation.

Putting it together:

Python code:
from typing import *
def delete_last_matching_inplace(a: List[int], k: int):
    for i, n in enumerate(reversed(a), 1):
        if n == k:
            del a[-i]
            return
    raise ValueError(f'no element in {a} matches {k}')
pre:
In [12]: a = [1, 2, 3, 2, 5, 2]

In [13]: delete_last_matching_inplace(a, 2)

In [14]: a
Out[14]: [1, 2, 3, 2, 5]

In [15]: delete_last_matching_inplace(a, 2)

In [16]: a
Out[16]: [1, 2, 3, 5]

In [17]: delete_last_matching_inplace(a, 2)

In [18]: a
Out[18]: [1, 3, 5]

In [19]: delete_last_matching_inplace(a, 2)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-dc87b4f3a28d> in <module>()
----> 1 delete_last_matching_inplace(a, 2)

<ipython-input-3-b7155902c6b2> in delete_last_matching_inplace(a, k)
      4             del a[-i]
      5             return
----> 6     raise ValueError(f'no element in {a} matches {k}')

ValueError: no element in [1, 3, 5] matches 2

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

Probably worth benchmarking it (Python's lists are apparently arrays, so a reversed iterator should be fast?) but you could always just iterate over the array normally, assign the index to a variable whenever you find a match, and then at the end it'll be set to the index of the last matching element (or None)

Array traversal should be fast either way, not sure about Python's implementation - reverse iteration is neater though (since it stops as early as possible)

PoizenJam
Dec 2, 2006

Damn!!!
It's PoizenJam!!!
Wow, great responses and super quick. Unfortunately the responses aren’t easily applied to my own program- and I decided instead to rebuild the program in a way that obviated the need for removal.

For anyone curious, I needed a list generated that includes 20 of each of the following:

NR
L0
L2P
L2T
L4P
L4T
L8P
L8T

For a total of 160 items in the list. All of these stand for different experimental conditions, and are randomly presented, but some conditions are related. The rules are:

L0 and NR can go anywhere in the list that doesn’t conflict with another rule.
For n = L2P, n + 1 = L2T
For n = L4P, n + 2 = L4T
For n = L8P, n + 4 = L8T

I’m phone posting now, but my new approach is to add the L(X)P items to the list at the start, shuffle the order, and use that as a seed for a pseudo-random procedural generator. The procedural generator will then populate the list with L(X)T items, using L0/NR items as filler when necessary.

It’s a heck of a lot more complicated than I thought ought to be necessary (~150 lines of code), and is slower than my usual experiment list generator, but I’m ironing out a final couple bugs (usually missing L(X)P items) and it appears to work.

Adbot
ADBOT LOVES YOU

Da Mott Man
Aug 3, 2012


Wallet posted:

You can use del to remove an item from a list by index (del list[0]), which might help, depending on what you're actually trying to do.

I like this method for pulling off the bottom of the list.

code:
somelist.pop(-1)
EDIT: replied to the wrong person

Da Mott Man fucked around with this message at 22:50 on Mar 2, 2018

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply