Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
huhu
Feb 24, 2006
I decided to write a Flask intro for anyone that is curious.

Why Flask?
In my opinion, Flask feels like just another package you would want to use on a project. It is simple to use and very modular. You only add new functionality – such as support for login, if you need it. It also does a great job of letting you write almost solely Python. For the backend, the database can be created without having to write SQL statements. On the frontend, forms and other page content can be created with Python instead of JavaScript. If you would prefer not to write CSS/HTML you can find pre-made static templates that are easy to integrate. Finally, it’s nice to be able to not have to deal with PHP.

How to get started
Both the book and tutorial are resources by the same author and they are awesome. I would recommend reading through both.
Flask Book
Flask Tutorial
Flask Documentation

What can it be used for?
I’ve used it to build simple static websites and as a way to serve content, specifically images from a security camera, from my server. Additionally, I followed a tutorial for building a blog. For my new job, I’ll be building dashboards for scientists to access and analyze their data.

Basic Demo
Setup
The basic setup steps are to install a virtual environment, activate it, and then install Flask.
code:
virtualenv venv_sa 
venv_sa\Scripts\activate.bat 
 pip install flask
Directory Structure
The directory stucture consists of the folder for the virtual environment, hello.py which is defined below, and a templates directory. Templates are written in HTML and data from databases or other sources can be added to them before they are returned to the end user as a webpage.
code:
hello.py
templates/
    user_page.html
venv_sa/
Hello.py
The code below starts with declaring a new instance of Flask named app. To the app we add routing which is how we figure out what information we want to send back to the user. The first route, '/' which points to index() will return "Hello World" when a user visits https://www.website.com/. The second route is a little special. If someone visits https://www.website.com/user/huhu, the string 'huhu' will be passed to the user_page() function. Additionally user_page() also has a current_time which gets the current time. These two bits of data are then added to the 'user_page.html' template with the render_template() function.
code:
import datetime
from flask import Flask, render_template

app = Flask(__name__)

# Routing
@app.route('/')
def index():
    return "Hello World."

@app.route('/<username>')
def user_page(username):
    current_time = datetime.datetime.now()
    return render_template('user_page.html', username=username, current_time = current_time)

if __name__ == '__main__':
    app.run(debug=True)
user_page.html
This template takes username and current_time from the user_page() function above, adds the data to the template, and then returns it to the user.
code:
Hello {{ username }},<br>
The current time is {{ current_time }}.
Launch the app
The last step is to simply run the following command and the app will start:
code:
python hello.py
The website can be viewed at 127.0.0.1:5000.

Adbot
ADBOT LOVES YOU

Dominoes
Sep 20, 2007

Linked to huhu's post in OP.

KingNastidon
Jun 25, 2004
Just getting started with pandas, so this is probably a really naive question. I'm trying to replicate a really basic algorithm from VBA for oncology forecasting. There is a data file with new patient starts and anticipated duration of therapy in monthly cohorts. Each new patient cohort loses a certain amount of patients each time period using an exponential function to mimic the shape of a kaplan-meier progression curve. The sum of a column represent the total remaining patients on therapy across all cohorts. Below is the code I'm using:

code:
for x in range(0, nummonths):
    for y in range(x, nummonths):
        cohorts.iloc[x,y] = data.ix[x,'nps'] * math.pow(data.ix[x,'duration'], y-x)
Calculating this n by n array with n>500 is instantaneous in VBA, but is taking 10+ seconds in pandas with jupyter. Any thoughts on what makes this so slow in pandas or better way to go about this?

KingNastidon fucked around with this message at 06:15 on Mar 31, 2017

FAT32 SHAMER
Aug 16, 2012



Ok here's a dumb question: I want to make a unicornhat (which uses a Python library) on a raspberry pi do something when a new order is placed, and according to the Shopify API, you would use their webhook to do it and you supply them with the URL for the webhook to deliver a HTTP POST payload to in json format.

I understand the concept of json stuff after working with the discord API but I've never worked with webhooks before, so correct me if I'm wrong but does that mean I have to have my own server running and listening for the HTTP POST payload? Since it's a raspberry pi would it be easier to try to sniff for like an email notification to trigger the light_sign action?

Any pointers would be great, I'm doing this for a friend's company in exchange for some of the neat t-shirts that he sells so I'd really like to find a way to make it work that doesn't take me a million years to figure out

Foxfire_
Nov 8, 2010

KingNastidon posted:

Just getting started with pandas, so this is probably a really naive question. I'm trying to replicate a really basic algorithm from VBA for oncology forecasting. There is a data file with new patient starts and anticipated duration of therapy in monthly cohorts. Each new patient cohort loses a certain amount of patients each time period using an exponential function to mimic the shape of a kaplan-meier progression curve. The sum of a column represent the total remaining patients on therapy across all cohorts. Below is the code I'm using:

code:
for x in range(0, nummonths):
    for y in range(x, nummonths):
        cohorts.iloc[x,y] = data.ix[x,'nps'] * math.pow(data.ix[x,'duration'], y-x)
Calculating this n by n array with n>500 is instantaneous in VBA, but is taking 10+ seconds in pandas with jupyter. Any thoughts on what makes this so slow in pandas or better way to go about this?

Python is extremely slow and extremely RAM hungry. This is a consequence of aspects of its type system and design. The interpreter has no way to know what to do for any operation without going through a dance of type lookups. It also can't prove that those operations are the same from loop to loop, so it has to redo all that work for every element.

The trick to get acceptable performance is to arrange it so that none of the actual computation is done in python (so it is fast) and none of the data is saved as python objects (so it is RAM efficient). You essentially have a bunch of python glue code that is telling external libraries what to do in broad strokes, with the actual computation done by not-python.

The main library you use to do this is numpy. Pandas is using numpy arrays as the storage for its DataFrame and Series values

A line like:

code:
foo = numpy.zeros(100, dtype=numpy.int32)
is saying "Numpy, allocate an array of 100 int32's and return a python object that represents that array". The actual array is allocated from the C runtime's heap and isn't made up of python objects.

Something like:

code:
foo *= 2
is saying "Numpy, go find the array represented by foo and multiply all its values by 2". No python code runs to do the multiplications.

A piece of code like this:
code:
for i in range(len(foo)):
  foo[i] *= 2
is saying, "In python, have i iterate from 0 to the length of foo. For each index, ask numpy to construct a python object who has the same value as the array at that index. Lookup its type, figure out what *=2 means for it (maybe it's integer math, maybe it's float math, maybe it's string concatenation, etc...), do that operation on the python object, then ask numpy to change the value of the array at that index to have the same value as the python object."

Basically any time you loop over your data, the performance will be crap.

Your options are:
1. Rearrange your code so that it's asking numpy to do bulk operations instead of doing them in python. This can be hard/impossible to do and isn't usually how people think about their problems.
2. Write your algorithm that's hard to express in C so that it's a new bulk operation you can do. Alternatly, dig through libraries to see if someone else has already done this for the algorithm you want.
3. Use numba. numba is a python-like sublanguage where enough python features are removed and types are locked down that it can be efficiently compiled. Depending on your algorithm rewriting it in numba could require zero work or could be more annoying that redoing it in C. (The example you posted would be on the zero work end)

QuarkJets
Sep 8, 2008

KingNastidon posted:

Just getting started with pandas, so this is probably a really naive question. I'm trying to replicate a really basic algorithm from VBA for oncology forecasting. There is a data file with new patient starts and anticipated duration of therapy in monthly cohorts. Each new patient cohort loses a certain amount of patients each time period using an exponential function to mimic the shape of a kaplan-meier progression curve. The sum of a column represent the total remaining patients on therapy across all cohorts. Below is the code I'm using:

code:
for x in range(0, nummonths):
    for y in range(x, nummonths):
        cohorts.iloc[x,y] = data.ix[x,'nps'] * math.pow(data.ix[x,'duration'], y-x)
Calculating this n by n array with n>500 is instantaneous in VBA, but is taking 10+ seconds in pandas with jupyter. Any thoughts on what makes this so slow in pandas or better way to go about this?

Basically I was going to say what Foxfire_ said; for loops in Python are slow and usually aren't what you want to use for bulk-computation. VBA is a compiled language, but Python is not.

The first-order speedup you can apply to code like this is to use vectorized numpy operations instead of for loops.
code:
import numpy as np
for x in range(0, nummonths):
    cohorts.iloc[x, x:nummonths] = data.ix[x, 'nps'] * np.power(data.ix[x, 'duration'], np.array(range(nummonths-x)))
That's one loop removed and should be significantly faster. It's possible to get rid of the outer loop as well using something like meshgrid to take care of the indexing but it's too late and I'm too tired to type it all out

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

QuarkJets posted:

Basically I was going to say what Foxfire_ said; for loops in Python are slow and usually aren't what you want to use for bulk-computation. VBA is a compiled language, but Python is not.

The first-order speedup you can apply to code like this is to use vectorized numpy operations instead of for loops.
code:

import numpy as np
for x in range(0, nummonths):
    cohorts.iloc[x, x:nummonths] = data.ix[x, 'nps'] * np.power(data.ix[x, 'duration'], np.array(range(nummonths-x)))

That's one loop removed and should be significantly faster. It's possible to get rid of the outer loop as well using something like meshgrid to take care of the indexing but it's too late and I'm too tired to type it all out

Pandas indexing is insanely slow compared to the underlying arrays

Also vectorize your poo poo

Or don't and use numba its a tossup as to which ends up faster

Malcolm XML
Aug 8, 2009

I always knew it would end like this.
This has little to do with python being compiled or not: pandas has a ton of overhead if you don't do things the correct way

Dominoes
Sep 20, 2007

I've been using wrappers like this if I'm working with dataframes and the speed is a limitation. Lets you use Pandas' descriptive row and column indicies instead of integers, while maintaining the speed of working with arrays. Pandas DFs can be orders-of-magnitude slower than arrays.

Python code:
class FastDf:
    def __init__(self, df: pd.DataFrame, convert_to_date: bool=False):
        """Used to improve DataFrame index speed by converting to an array. """
        if convert_to_date:
            # Convert from PD timestamps to datetime.date.
            self.row_indexes = {row.date(): i for i, row in enumerate(df.index)}
        else:
            self.row_indexes = {row: i for i, row in enumerate(df.index)}
        self.column_indexes = {col: i for i, col in enumerate(df.columns)}
        self.array = df.values

    def loc(self, row, col):
        return self.array[self.row_indexes[row], self.column_indexes[col]]

    def loc_range(self, row_start, row_end):
        return self.array[self.row_indexes[row_start]: self.row_indexes[row_end]]

Dominoes fucked around with this message at 12:09 on Mar 31, 2017

SurgicalOntologist
Jun 17, 2004

KingNastidon posted:

Just getting started with pandas, so this is probably a really naive question. I'm trying to replicate a really basic algorithm from VBA for oncology forecasting. There is a data file with new patient starts and anticipated duration of therapy in monthly cohorts. Each new patient cohort loses a certain amount of patients each time period using an exponential function to mimic the shape of a kaplan-meier progression curve. The sum of a column represent the total remaining patients on therapy across all cohorts. Below is the code I'm using:

code:
for x in range(0, nummonths):
    for y in range(x, nummonths):
        cohorts.iloc[x,y] = data.ix[x,'nps'] * math.pow(data.ix[x,'duration'], y-x)
Calculating this n by n array with n>500 is instantaneous in VBA, but is taking 10+ seconds in pandas with jupyter. Any thoughts on what makes this so slow in pandas or better way to go about this?

I don't fully understand the algorithm but the first step is to try to do this with vectors operation. This might not work as written but may only need small tweaks. At least it should illustrate the idea of vectorizing the operation.

Python code:
x, y = np.meshgrid(range(0, nummonths), range(0, nummonths))
cohorts = data['nps'] * data['duration']**(y - x)

QuarkJets
Sep 8, 2008

Malcolm XML posted:

This has little to do with python being compiled or not: pandas has a ton of overhead if you don't do things the correct way

Well there you go, I didn't know that because I don't use pandas (like at all)

Space Kablooey
May 6, 2009


funny Star Wars parody posted:

Ok here's a dumb question: I want to make a unicornhat (which uses a Python library) on a raspberry pi do something when a new order is placed, and according to the Shopify API, you would use their webhook to do it and you supply them with the URL for the webhook to deliver a HTTP POST payload to in json format.

I understand the concept of json stuff after working with the discord API but I've never worked with webhooks before, so correct me if I'm wrong but does that mean I have to have my own server running and listening for the HTTP POST payload? Since it's a raspberry pi would it be easier to try to sniff for like an email notification to trigger the light_sign action?

Any pointers would be great, I'm doing this for a friend's company in exchange for some of the neat t-shirts that he sells so I'd really like to find a way to make it work that doesn't take me a million years to figure out

Yes, that is correct. Your URL should point to a live server that is expecting a POST with the format that Shopify sends you. I don't know any specifics of Raspberry, so I can't really help you with your second question.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

funny Star Wars parody posted:

Ok here's a dumb question: I want to make a unicornhat (which uses a Python library) on a raspberry pi do something when a new order is placed, and according to the Shopify API, you would use their webhook to do it and you supply them with the URL for the webhook to deliver a HTTP POST payload to in json format.

I understand the concept of json stuff after working with the discord API but I've never worked with webhooks before, so correct me if I'm wrong but does that mean I have to have my own server running and listening for the HTTP POST payload? Since it's a raspberry pi would it be easier to try to sniff for like an email notification to trigger the light_sign action?

Any pointers would be great, I'm doing this for a friend's company in exchange for some of the neat t-shirts that he sells so I'd really like to find a way to make it work that doesn't take me a million years to figure out

Yeah, you need a server running. Basically shopify will call an url on a server you control when orders are placed. Then your code does something when that happens. It will work fine on a raspberry pi.

Flask and Django both are both capable of doing this is like 20 lines of code or something. The hardest part will be learning how to do it because for some reason understanding how the web works is not easy despite it being pretty simple.

Tigren
Oct 3, 2003

Thermopyle posted:

Yeah, you need a server running. Basically shopify will call an url on a server you control when orders are placed. Then your code does something when that happens. It will work fine on a raspberry pi.

Flask and Django both are both capable of doing this is like 20 lines of code or something. The hardest part will be learning how to do it because for some reason understanding how the web works is not easy despite it being pretty simple.

This is the perfect use case for Flask. You don't need the bells and whistles of Django.

Python code:
from flask import Flask, request

from raspberry_pi_funcs import blink_bright_light() # I don't know how the raspberry pi functions work

app = Flask(__name__)

@app.route('/', methods=['POST'])
def index():
        data = request.get_json()
	if data['total_line_items_price'] > 100:
		blink_bright_light()
	return "Success"

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Tigren posted:

This is the perfect use case for Flask. You don't need the bells and whistles of Django.

Python code:
from flask import Flask, request

from raspberry_pi_funcs import blink_bright_light() # I don't know how the raspberry pi functions work

app = Flask(__name__)

@app.route('/', methods=['POST'])
def index():
        data = request.get_json()
	if data['total_line_items_price'] > 100:
		blink_bright_light()
	return "Success"

FWIW, it's about the same in Django. Flask is fine and there's a 50/50 chance I'd use it for the project myself, but occasionally I like to point out that people have this idea that Django is really complicated or something. It includes a lot more "in the box" than Flask, but it'd work basically just like Flask in this case. Single file, basically same # lines of code, etc.

FAT32 SHAMER
Aug 16, 2012



I got promoted to developer recently and my first project has been building a file directory browser in php and Ajax and HTML and it's a fuckton to take in in two weeks so ya idk what it is with web stuff that is so hard to learn (probably all the acronyms and different languages/frameworks interfacing) but ya appreciate it! Flask looks like the way to go 😊 thanks!

Dex
May 26, 2006

Quintuple x!!!

Would not escrow again.

VERY MISLEADING!

condolences

FAT32 SHAMER
Aug 16, 2012



Dex posted:

condolences

week 2: i genuinely lust for death

xpander
Sep 2, 2004

funny Star Wars parody posted:

Ok here's a dumb question: I want to make a unicornhat (which uses a Python library) on a raspberry pi do something when a new order is placed, and according to the Shopify API, you would use their webhook to do it and you supply them with the URL for the webhook to deliver a HTTP POST payload to in json format.

I understand the concept of json stuff after working with the discord API but I've never worked with webhooks before, so correct me if I'm wrong but does that mean I have to have my own server running and listening for the HTTP POST payload? Since it's a raspberry pi would it be easier to try to sniff for like an email notification to trigger the light_sign action?

Any pointers would be great, I'm doing this for a friend's company in exchange for some of the neat t-shirts that he sells so I'd really like to find a way to make it work that doesn't take me a million years to figure out

Echoing the other replies that Flask/Django are perfect for this. On the operational side, check out Zappa for some sweet serverless action. This will let you use some AWS services(API Gateway, Lambda) to run the application code without having to keep a server online. You'll fall well within the free tier for development usage, and probably well into their production load as well.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Thanks for reminding me about Zappa. I keep meaning to try it out on my next project and then forget.

edit: oh yeah now I remember why I really haven't tried it. No Python 3 support. :(

Hughmoris
Apr 21, 2007
Let's go to the abyss!
*Disregard, found answer in OP.

Hughmoris fucked around with this message at 00:44 on Apr 1, 2017

xpander
Sep 2, 2004

Thermopyle posted:

Thanks for reminding me about Zappa. I keep meaning to try it out on my next project and then forget.

edit: oh yeah now I remember why I really haven't tried it. No Python 3 support. :(

Well, the real problem is that Lambda doesn't have Python 3 support. It's pretty annoying, but for some projects, might work fine.

FAT32 SHAMER
Aug 16, 2012



I use Python 2.7 anyways so yayyy

Edit: how much does Zappa cost amper month for a server for a single webhook that is triggered maybe 5-10 times per day?

FAT32 SHAMER fucked around with this message at 00:57 on Apr 1, 2017

Space Kablooey
May 6, 2009


funny Star Wars parody posted:

I use Python 2.7 anyways so yayyy

Edit: how much does Zappa cost amper month for a server for a single webhook that is triggered maybe 5-10 times per day?

The pricing depends on Amazon Lambda's pricing, but you will probably be well within the free tier.

Dominoes
Sep 20, 2007

funny Star Wars parody posted:

I got promoted to developer recently and my first project has been building a file directory browser in php and Ajax and HTML and it's a fuckton to take in in two weeks so ya idk what it is with web stuff that is so hard to learn (probably all the acronyms and different languages/frameworks interfacing) but ya appreciate it! Flask looks like the way to go 😊 thanks!
Yo dawg. Web dev is hard because there are a lot of independent skills you need to learn. You need databases, backend-frameworks, HTML, CSS, JS. AKAX, migrations and whatever extra frameworks you're adding. You'll probably have to find independent tutorials for each. It's a pain at first, but keep at it!

Re Flask vs Django: Django for websites, Flask for other things you need a webserver for. Flask becomes a pain once you start adding plugins for things like admin, auth, migrations, databases etc.

Dominoes fucked around with this message at 02:35 on Apr 1, 2017

Space Kablooey
May 6, 2009


Dominoes posted:

Re Flask vs Django: Django for websites, Flask for other things you need a webserver for. Flask becomes a pain once you start adding plugins for things like admin, auth, migrations, databases etc.

Outside of a very specific library (Flask-Social), I had no issues with any Flask extension so far.

I'll give you that it's easier than Django to make a mess of packages and modules and dependencies when your project grows.

Space Kablooey fucked around with this message at 03:25 on Apr 1, 2017

Foxfire_
Nov 8, 2010

Malcolm XML posted:

This has little to do with python being compiled or not: pandas has a ton of overhead if you don't do things the correct way

Huh, I was expecting the pandas overhead to be much smaller. Some testing shows that it goes:

vectorized numpy > vectorized pandas >> iterative numpy >>> iterative pandas

An iterative solution in something with known types (C/Fortran/Java/numba/Julia/whatever) will still be faster for complicated calculations than the vectorized versions since the vectorized ones destroy all cache locality once the dataset is big enough (by the time you go back to do operation 2 on item 1, it's been booted out of the cache), but you can still get enough speedup to move some algorithms from unworkable to useable.

Python's speed problems mostly don't have to do with it being compiled in advance or not. They're consequences of how it works with types and the stuff it lets you change. It's basically impossible to JIT compile stuff like you would in Java because you can't prove that a loop isn't doing things like monkey patching functions or operators at some random loop iteration in the middle. The interpreter has to execute what you put down naively since it can't prove that it's safe to take any shortcuts.

Timing results:
code:
import numpy as np
import pandas as pd

numpy_array1 = np.random.rand(100000)
numpy_array2 = np.random.rand(100000)

print("Vectorized numpy")
%timeit  out = numpy_array + numpy_array2

print("Iterative numpy")
out = np.empty(100000, dtype=np.float64)
%timeit for i in np.arange(100000): out[i] = numpy_array1[i] + numpy_array2[i]

pandas_dataframe = pd.DataFrame({'A': numpy_array1, 'B': numpy_array2})
print("Vectorized pandas")
%timeit out = pandas_dataframe.A + pandas_dataframe.B

print("Iterative pandas")
out = np.empty(100000, dtype=np.float64)
%timeit for i in np.arange(100000): out[i] = pandas_dataframe.A.iloc[i] + pandas_dataframe.B.iloc[i]
code:
Vectorized numpy
10000 loops, best of 3: 150 µs per loop
Iterative numpy
10 loops, best of 3: 52.1 ms per loop
Vectorized pandas
10000 loops, best of 3: 181 µs per loop
Iterative pandas
1 loop, best of 3: 4.3 s per loop

Tigren
Oct 3, 2003

xpander posted:

Echoing the other replies that Flask/Django are perfect for this. On the operational side, check out Zappa for some sweet serverless action. This will let you use some AWS services(API Gateway, Lambda) to run the application code without having to keep a server online. You'll fall well within the free tier for development usage, and probably well into their production load as well.

How do you trigger local raspberry pi functions with Lambda?

FAT32 SHAMER
Aug 16, 2012



Tigren posted:

How do you trigger local raspberry pi functions with Lambda?

It's a unicornhat Python library so I can just call the function that I already wrote to light the UH when the webhook gets a hit, I think

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Foxfire_ posted:

Huh, I was expecting the pandas overhead to be much smaller. Some testing shows that it goes:

vectorized numpy > vectorized pandas >> iterative numpy >>> iterative pandas

An iterative solution in something with known types (C/Fortran/Java/numba/Julia/whatever) will still be faster for complicated calculations than the vectorized versions since the vectorized ones destroy all cache locality once the dataset is big enough (by the time you go back to do operation 2 on item 1, it's been booted out of the cache), but you can still get enough speedup to move some algorithms from unworkable to useable.

Python's speed problems mostly don't have to do with it being compiled in advance or not. They're consequences of how it works with types and the stuff it lets you change. It's basically impossible to JIT compile stuff like you would in Java because you can't prove that a loop isn't doing things like monkey patching functions or operators at some random loop iteration in the middle. The interpreter has to execute what you put down naively since it can't prove that it's safe to take any shortcuts.

Timing results:
code:
import numpy as np
import pandas as pd

numpy_array1 = np.random.rand(100000)
numpy_array2 = np.random.rand(100000)

print("Vectorized numpy")
%timeit  out = numpy_array + numpy_array2

print("Iterative numpy")
out = np.empty(100000, dtype=np.float64)
%timeit for i in np.arange(100000): out[i] = numpy_array1[i] + numpy_array2[i]

pandas_dataframe = pd.DataFrame({'A': numpy_array1, 'B': numpy_array2})
print("Vectorized pandas")
%timeit out = pandas_dataframe.A + pandas_dataframe.B

print("Iterative pandas")
out = np.empty(100000, dtype=np.float64)
%timeit for i in np.arange(100000): out[i] = pandas_dataframe.A.iloc[i] + pandas_dataframe.B.iloc[i]
code:
Vectorized numpy
10000 loops, best of 3: 150 &#181;s per loop
Iterative numpy
10 loops, best of 3: 52.1 ms per loop
Vectorized pandas
10000 loops, best of 3: 181 &#181;s per loop
Iterative pandas
1 loop, best of 3: 4.3 s per loop

Have you tried with Cython? (no clue if it works with pandas or numpy)

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

Foxfire_ posted:

Huh, I was expecting the pandas overhead to be much smaller. Some testing shows that it goes:

vectorized numpy > vectorized pandas >> iterative numpy >>> iterative pandas

An iterative solution in something with known types (C/Fortran/Java/numba/Julia/whatever) will still be faster for complicated calculations than the vectorized versions since the vectorized ones destroy all cache locality once the dataset is big enough (by the time you go back to do operation 2 on item 1, it's been booted out of the cache), but you can still get enough speedup to move some algorithms from unworkable to useable.

Python's speed problems mostly don't have to do with it being compiled in advance or not. They're consequences of how it works with types and the stuff it lets you change. It's basically impossible to JIT compile stuff like you would in Java because you can't prove that a loop isn't doing things like monkey patching functions or operators at some random loop iteration in the middle. The interpreter has to execute what you put down naively since it can't prove that it's safe to take any shortcuts.

Timing results:
code:

import numpy as np
import pandas as pd

numpy_array1 = np.random.rand(100000)
numpy_array2 = np.random.rand(100000)

print("Vectorized numpy")
%timeit  out = numpy_array + numpy_array2

print("Iterative numpy")
out = np.empty(100000, dtype=np.float64)
%timeit for i in np.arange(100000): out[i] = numpy_array1[i] + numpy_array2[i]

pandas_dataframe = pd.DataFrame({'A': numpy_array1, 'B': numpy_array2})
print("Vectorized pandas")
%timeit out = pandas_dataframe.A + pandas_dataframe.B

print("Iterative pandas")
out = np.empty(100000, dtype=np.float64)
%timeit for i in np.arange(100000): out[i] = pandas_dataframe.A.iloc[i] + pandas_dataframe.B.iloc[i]

code:

Vectorized numpy
10000 loops, best of 3: 150 µs per loop
Iterative numpy
10 loops, best of 3: 52.1 ms per loop
Vectorized pandas
10000 loops, best of 3: 181 µs per loop
Iterative pandas
1 loop, best of 3: 4.3 s per loop

It is entirety possible to JIT compile python using techniques similar to JavaScript -- hidden classes, partial evaluation, polymorphic inline caching-- but there's a lot more money behind Js dev


Pypy does a lot of good things but is a research project first . Maybe graal will help , jruby augmented with truffle and graal is insanely fast comparatively

oliveoil
Apr 22, 2016
Is there anything like the JVM spec but for Python?

I want to know what I can assume about the way Python works. I want to know what's the memory overhead for creating an object, for example. How many bits are in an integer? And so on. I don't want to cobble together answers from StackOverflow that maybe describe the behavior of a specific Python implementation that isn't necessarily fixed, for example.

Also, I'd like to separately (so that I know what is prescribed for all Python implementations, and what is specific to or just an implementation detail of the most popular implementation) about the internals of the most popular Python implementation. I'm guessing that's CPython, but I imagine 2.7 and 3.blah are pretty different. In general, though, I want to know stuff like how does garbage collection work in CPython? What happens when I create a thread? How do locks work? Java has a bunch of explanation about "happens-before" relationships when writing multi-threaded programs - what's the Python equivalent?

oliveoil
Apr 22, 2016
Ugh. From the Python documentation, "Floating point numbers are usually implemented using double in C;": https://docs.python.org/2/library/stdtypes.html#numeric-types-int-float-long-complex

"Usually". So what's the behavior I can assume I always have? :(

Eela6
May 25, 2007
Shredded Hen

oliveoil posted:

Is there anything like the JVM spec but for Python?

I want to know what I can assume about the way Python works. I want to know what's the memory overhead for creating an object, for example. How many bits are in an integer? And so on. I don't want to cobble together answers from StackOverflow that maybe describe the behavior of a specific Python implementation that isn't necessarily fixed, for example.

Also, I'd like to separately (so that I know what is prescribed for all Python implementations, and what is specific to or just an implementation detail of the most popular implementation) about the internals of the most popular Python implementation. I'm guessing that's CPython, but I imagine 2.7 and 3.blah are pretty different. In general, though, I want to know stuff like how does garbage collection work in CPython? What happens when I create a thread? How do locks work? Java has a bunch of explanation about "happens-before" relationships when writing multi-threaded programs - what's the Python equivalent?

CPython is the standard. The best place to start is probably the Python/C API Reference Manual

Garbage Collection is handled by reference counting. Objects whose live references hit 0 are immediately. Cpython countains an occasional cyclic garbage detector; while the running of the garbage collection can be forced, outside of that it's behavior is not (meant to be) predictable.

Python is more about knowing how things behave than 'what they are'. EG, All integers are implemented as “long” integer objects of arbitrary size.

Python Doubles behave as IEEE 64-bit doubles, full stop.

If you want to know more about Python's data model and how things work behind the scenes, I recommend Fluent Python, as I do to everyone.

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

oliveoil posted:

Is there anything like the JVM spec but for Python?

I want to know what I can assume about the way Python works. I want to know what's the memory overhead for creating an object, for example. How many bits are in an integer? And so on. I don't want to cobble together answers from StackOverflow that maybe describe the behavior of a specific Python implementation that isn't necessarily fixed, for example.

Also, I'd like to separately (so that I know what is prescribed for all Python implementations, and what is specific to or just an implementation detail of the most popular implementation) about the internals of the most popular Python implementation. I'm guessing that's CPython, but I imagine 2.7 and 3.blah are pretty different. In general, though, I want to know stuff like how does garbage collection work in CPython? What happens when I create a thread? How do locks work? Java has a bunch of explanation about "happens-before" relationships when writing multi-threaded programs - what's the Python equivalent?

LMAO its whatever Guido fever dreams up and shits out in Cpython


Hell it took a heroic effort to get ruby specified and that had people who cared behind it

Spelunk through the source. Cpython uses a ref counting mechanic fwiw

FingersMaloy
Dec 23, 2004

Fuck! That's Delicious.
I'm still trying to make this scraper work. I've abandoned BeautifulSoup and totally committed to Scrapy, my spider works but I can't make it pull the exact pieces I need. I'm using this code as my guide, but it's not fully working:

https://github.com/jayfeng1/Craigslist-Pricing-Project/blob/master/craigslist/spiders/CraigSpyder.py

He explains the methodology here:

http://www.racketracer.com/2015/01/29/practical-scraping-using-scrapy/

Python code:
# -*- coding: utf-8 -*-
import scrapy
from craig.items import CraigItem

class BasicSpider(scrapy.Spider):
    name = "basic"
    allowed_domains = ["https://cleveland.craigslist.org/"]
    start_urls = ['https://cleveland.craigslist.org/search/hhh?query=no+section+8&postedToday=1&availabilityMode=0']

    def parse(self, response):
        titles = response.xpath("//p")
        for titles in titles:
            item = CraigItem()
            item['title'] = titles.xpath("a/text()").extract()
            item['link'] = titles.xpath("a/@href").extract()
            items.append(item)
            follow = "https://cleveland.craigslist.org" + item['link']
            request = scrapy.Request(follow , callback=self.parse_item_page)
            request.meta = item
            yield request
        
    def parse_item_page(self, response):
        maplocation = response.xpath("//div[contains(@id,'map')]")
        latitude = maplocation.xpath('@data-latitude').extract()
        longitude = maplocation.xpath('@data-longitude').extract()
        if latitude:
            item['latitude'] = float(latitude)
        if longitude:
            item['longitude'] = float(longitude)
        return item
On line 18 (follow =), I get: TypeError cannot concatenate 'str' and 'list' objects.

I run this command to execute the program: scrapy crawl basic -o items.csv -t csv. If remove the second method I can get a spreadsheet with titles and links, but I need the geotag.

Any ideas?

Jose Cuervo
Aug 25, 2004

FingersMaloy posted:

I'm still trying to make this scraper work. I've abandoned BeautifulSoup and totally committed to Scrapy, my spider works but I can't make it pull the exact pieces I need. I'm using this code as my guide, but it's not fully working:

https://github.com/jayfeng1/Craigslist-Pricing-Project/blob/master/craigslist/spiders/CraigSpyder.py

He explains the methodology here:

http://www.racketracer.com/2015/01/29/practical-scraping-using-scrapy/

Python code:
# -*- coding: utf-8 -*-
import scrapy
from craig.items import CraigItem

class BasicSpider(scrapy.Spider):
    name = "basic"
    allowed_domains = ["https://cleveland.craigslist.org/"]
    start_urls = ['https://cleveland.craigslist.org/search/hhh?query=no+section+8&postedToday=1&availabilityMode=0']

    def parse(self, response):
        titles = response.xpath("//p")
        for titles in titles:
            item = CraigItem()
            item['title'] = titles.xpath("a/text()").extract()
            item['link'] = titles.xpath("a/@href").extract()
            items.append(item)
            follow = "https://cleveland.craigslist.org" + item['link']
            request = scrapy.Request(follow , callback=self.parse_item_page)
            request.meta = item
            yield request
        
    def parse_item_page(self, response):
        maplocation = response.xpath("//div[contains(@id,'map')]")
        latitude = maplocation.xpath('@data-latitude').extract()
        longitude = maplocation.xpath('@data-longitude').extract()
        if latitude:
            item['latitude'] = float(latitude)
        if longitude:
            item['longitude'] = float(longitude)
        return item
On line 18 (follow =), I get: TypeError cannot concatenate 'str' and 'list' objects.

I run this command to execute the program: scrapy crawl basic -o items.csv -t csv. If remove the second method I can get a spreadsheet with titles and links, but I need the geotag.

Any ideas?

The error message is telling you that you are trying to concatenate a string and a list - apparently item['link'] is a list. I think your error is your for statement ("for titleS in titles:"), where I think you actually want to say "for title in titleS:", and then change occurrences of titleS in the for loop to title.

FingersMaloy
Dec 23, 2004

Fuck! That's Delicious.
I changed the first titles to title and edited the loop but it's throwing the same error. :(

Space Kablooey
May 6, 2009


The error is pretty straightforward, item['link'] is a list of somethings (or it could be empty). You can try printing it out and seeing if whatever you want is in there, and then you can concatenate with your follow link.

Adbot
ADBOT LOVES YOU

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

From a quick look at the documentation extract() returns a list of things that match your selector, and extract_first() pulls out a single element

You're probably getting lists with a single item when you really just wanted the item on its own

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply