Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

BigRedDot posted:

A few weeks ago my company put on PyData NYC, a conference dedicated to data analytics with python. Authors and contributors of numpy, scipy, pandas, pytables, ipython, and other projects all gave great talks to several hundred attendees. Today all the talks were made available on Vimeo, for anyone interested!

http://vimeo.com/channels/pydata/videos/sort:preset/format:detail
I know I thanked you already for posting this, but I just got around to watching the AppNexus talk and stumbled upon so many things immediately relevant to my current time-series work that I needed to thank you again for posting this (and your company for hosting the conference). So thanks again!

Adbot
ADBOT LOVES YOU

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Suspicious Dish posted:

You can use Gio to do the copy directly, which will use the credentials from the keyring.

Python code:
from gi.repository import Gio

source = Gio.File.new_for_path("/home/thermopyle/garbage/virus.dll")
target = Gio.File.new_for_uri("smb:///ebay/C$/Windows/system32/kernel32.dll")

source.copy(target, Gio.FileCopyFlags.NONE, None, None)

Perfect. Thanks!

edit: not quite perfect. Your method signature is missing the second argument for a progress_callback. According to this documentation you can pass None for that argument. But when I pass None I get:

code:
TypeError: Argument 2 does not allow None as a value
edit2: Ok, I was wrong, but when I use your method signature it tells me it takes 6 arguments...

Thermopyle fucked around with this message at 21:57 on Nov 11, 2012

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
There are seven arguments in the C version of the function. One is the GError out pointer, which is transformed into a Python exception. The other is the progress_callback_data, which is used by most bindings internally to store a copy to the current scope. But what I forgot is that PyGObject exposes that argument, and stores the scope and the argument passed in. It exposes that argument for compatibility reasons, I think. You should always pass None for it.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Suspicious Dish posted:

There are seven arguments in the C version of the function. One is the GError out pointer, which is transformed into a Python exception. The other is the progress_callback_data, which is used by most bindings internally to store a copy to the current scope. But what I forgot is that PyGObject exposes that argument, and stores the scope and the argument passed in. It exposes that argument for compatibility reasons, I think. You should always pass None for it.

Ok, gotcha.

What's the deal with this documentation, then? http://www.pygtk.org/docs/pygobject/class-giofile.html#method-giofile--copy

Is it just plain wrong, or is there more current documentation that I can't find?

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
That's for PyGTK, which is dead. It's certainly been confusing a lot of people, so I'll see if I can have someone who maintains the PyGTK website help people find more accurate documentation. The thing is that we don't have generated documentation for Python yet, but we're working on it, I swear.

I guess the best resource right now is the gir file in /usr/share/gir-1.0/Gio.gir. PyGObject reads from this to give you the bindings at runtime.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Suspicious Dish posted:

That's for PyGTK, which is dead. It's certainly been confusing a lot of people, so I'll see if I can have someone who maintains the PyGTK website help people find more accurate documentation. The thing is that we don't have generated documentation for Python yet, but we're working on it, I swear.

I guess the best resource right now is the gir file in /usr/share/gir-1.0/Gio.gir. PyGObject reads from this to give you the bindings at runtime.

Ok, understood. Thanks for all the info, I got it working just like I needed.

Captain Capacitor
Jan 21, 2008

The code you say?

BigRedDot posted:

A few weeks ago my company put on PyData NYC, a conference dedicated to data analytics with python. Authors and contributors of numpy, scipy, pandas, pytables, ipython, and other projects all gave great talks to several hundred attendees. Today all the talks were made available on Vimeo, for anyone interested!

http://vimeo.com/channels/pydata/videos/sort:preset/format:detail

I've got to echo the thanks for this. I've been reading a book about interfacing Python with R as well as messing with time series data in MongoDB, so this has been a great series to watch.

Pudgygiant
Apr 8, 2004

Garnet and black? More like gold and blue or whatever the fuck colors these are
Stupid question that I'm too dumb to figure out on my own. I'm trying to pick up the basics of page scraping with Python, and Google has to screw everything up by having their stock price ref id be different on every stock. If I'm looking for this string:
<td class=price><span id=ref_12441984_l>
why doesn't this work?
re.search(b'class=price><span id=(.*?)>', content)

good jovi
Dec 11, 2000

'm pro-dickgirl, and I VOTE!

Pudgygiant posted:

Stupid question that I'm too dumb to figure out on my own. I'm trying to pick up the basics of page scraping with Python, and Google has to screw everything up by having their stock price ref id be different on every stock. If I'm looking for this string:
<td class=price><span id=ref_12441984_l>
why doesn't this work?
re.search(b'class=price><span id=(.*?)>', content)

This isn't exactly answering your question, but parsing html using regexes is just a really bad idea. Use something like scrapy, or a parsing library like pyquery or lxml.html.

(see here for why regexes are bad in this case)

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Sailor_Spoon posted:

(see here for why regexes are bad in this case)

I've read that so many times that I knew in my bones what you were linking to without even looking at the url, and I still clicked it and read it again because it still makes me laugh.

Pudgygiant
Apr 8, 2004

Garnet and black? More like gold and blue or whatever the fuck colors these are
I can't even put into words how dumb that page makes me feel. Probably because I'm too dumb to know the words.

good jovi
Dec 11, 2000

'm pro-dickgirl, and I VOTE!

Pudgygiant posted:

I can't even put into words how dumb that page makes me feel. Probably because I'm too dumb to know the words.

To address your actual issue, because as bad as regexes are for html, we've all done it, I get the following:

Python code:
In [3]: content='<td class=price><span id=ref_12441984_l>'

In [4]: m = re.search('class=price><span id=(.*?)>', content)

In [5]: m.groups()
Out[5]: ('ref_12441984_l',)
Is that not what you wanted?

Pudgygiant
Apr 8, 2004

Garnet and black? More like gold and blue or whatever the fuck colors these are
That's exactly what I want, and drat if it doesn't work fine doing it that way. Now I have to figure out why it doesn't when it's crawling the whole page :bang:

Thanks for looking at it dude.

Yay
Aug 4, 2007

Pudgygiant posted:

Now I have to figure out why it doesn't when it's crawling the whole page
double check your regular expression is searching in multi-line mode, rather than single line. Or make sure your input string is on one line, I guess.

BigRedDot
Mar 6, 2008

I'm glad you liked the videos of the talks. We are doing another PyData on the west coast in March during the PyCon Sprints, I'll post those when they are available, too!

NadaTooma
Aug 24, 2004

The good thing is that everyone around you has more critical failures in combat, the bad thing is - so do you!

Pudgygiant posted:

That's exactly what I want, and drat if it doesn't work fine doing it that way. Now I have to figure out why it doesn't when it's crawling the whole page :bang:

Thanks for looking at it dude.

Just a guess, but is it possible that the since the regex is using ".*", it's being too greedy? How about this:
Python code:
>>> import re
>>> content='<td class=price><span id=ref_12441984_l>'
>>> m = re.search('class=price><span id=([^>]*?)>', content)
>>> m.groups()
('ref_12441984_l',)
It's a minor tweak, using "[^>]" to match non-closing-bracket characters, rather than all possible characters.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
Wait guys, hold on, give me a few minutes to get my square peg.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes.

geonetix
Mar 6, 2011


It has been stated before, but by god don't parse HTML with regex, ever. Use lxml and be done with it. Your regex is going to break at some point and there is simply no escaping that fact. In the simplest cases of the two earlier examples, what happens if the id - just to prove a point - is "test<script>alert('lol');</script>"? The regexes would've broken.

So. Don't use regex. Use lxml.

Murodese
Mar 6, 2007

Think you've got what it takes?
We're looking for fine Men & Women to help Protect the Australian Way of Life.

Become part of the Legend. Defence Jobs.

Sailor_Spoon posted:

This isn't exactly answering your question, but parsing html using regexes is just a really bad idea. Use something like scrapy, or a parsing library like pyquery or lxml.html.

(see here for why regexes are bad in this case)

I laugh so loving hard every single time I see that page, even though I know it's coming.

Hed
Mar 31, 2004

Fun Shoe
Being able to use Python like that in Pig is :swoon:tastic

Thern
Aug 12, 2006

Say Hello To My Little Friend
So I'm trying to get better at unit testing, and one thing that gives me a hard time, is dealing with external dependencies. For instance I want to test a piece of code that relies on using Popen, but I want to mock up Popen itself so that I'm only testing the function.

From what I can tell, patch is supposed to let me do that. I've been trying to get it to work, but it still seems to call the original function.

Class that I want to test:

Python code:
class MyClass():
    def MyMethod(self):
        something = "MyStuff"
        return Popen(something)
And here is my unit test:
Python code:
class TestMyClass(TestCase):
    @patch("subprocess.Popen")
    def test_MyMethod(self,popen):
        popen.return_value = 1
        test = MyClass()
        test.MyMethod()
        popen.assert_called_with_once()
This gives me the following error:
code:
WindowsError: [Error 2] The system cannot find the file specified
Which means that it is trying to use the original Popen, and not the mock Popen that I want it to. Any ideas of what I'm doing wrong?

Titan Coeus
Jul 30, 2007

check out my horn

Thern posted:

Testing woes

Looking at it, my guess is you should change @patch("subprocess.Popen") to @patch("popen"). I'm not familiar with this testing framework though so that could be entirely off base.

Civil Twilight
Apr 2, 2011

Either change MyClass to use subprocess.Popen, or change your @patch decorator to patch myclass.Popen (where 'myclass' is whatever namespace MyClass is in). Otherwise MyClass will still reference the 'real' Popen, and not the patched version. See http://www.voidspace.org.uk/python/mock/patch.html#where-to-patch

Thern
Aug 12, 2006

Say Hello To My Little Friend
Changing MyClass to use subprocess.Popen instead of Popen did the trick for me. I always feel bad when I get stuck on namespace issues.

Thanks a lot for your help guys.

FoiledAgain
May 6, 2007

I find myself doing this kind of thing a lot:

code:

inventory = getInventory()#returns something like [Object1, Object2, Object3]
d = dict()
for item in inventory:
    d[item.name] = item

Basically, I have a list of objects passed to me (I don't always know anything about them or how many are in the list, but I do know that they all have a .name attribute) and I need to turn it into a dictionary where the keys are the name attribute and the values are the objects themselves. Is there some dictionary method I could use to make this a single line? (It's not that I need a one-liner, I'm just a low-intermediate programmer wondering about new ways to do things.)

Nippashish
Nov 2, 2005

Let me see you dance!

FoiledAgain posted:

Basically, I have a list of objects passed to me (I don't always know anything about them or how many are in the list, but I do know that they all have a .name attribute) and I need to turn it into a dictionary where the keys are the name attribute and the values are the objects themselves. Is there some dictionary method I could use to make this a single line? (It's not that I need a one-liner, I'm just a low-intermediate programmer wondering about new ways to do things.)

dict() will accept a sequence of (key, value) tuples so you can do something like d = dict((i.name, i) for i in getInventory()).

accipter
Sep 12, 2003

Nippashish posted:

dict() will accept a sequence of (key, value) tuples so you can do something like d = dict((i.name, i) for i in getInventory()).

This would work as well:
d = {i.name: i for in getInventory()}

Analytic Engine
May 18, 2009

not the analytical engine
edit: Never mind, I basically duplicated Nippashish's answer.

Analytic Engine fucked around with this message at 03:19 on Nov 17, 2012

FoiledAgain
May 6, 2007

Nippashish posted:

dict() will accept a sequence of (key, value) tuples so you can do something like d = dict((i.name, i) for i in getInventory()).

Oh that's easy. Thanks! Thanks to accipter too. I tried dict.fromkeys([x.name for x in inventory], inventory) but that gave me the unexpected result that the entire inventory was the value for each .name key.

Bunny Cuddlin
Dec 12, 2004

accipter posted:

This would work as well:
d = {i.name: i for in getInventory()}

It's also faster (probably trivially)

FoiledAgain
May 6, 2007

Bunny Cuddlin posted:

It's also faster (probably trivially)

I only understood about 1/8 of that, but the take-away message seemed to be that you should use d={} instead of d=dict(). I'm slightly disappointed by this, because it's my habit to use dict() for new dictionaries. But that doesn't look like the best practice. Am I similarly wasting time using list(), int(), str()? That said, I'm happy to learn about dictionary comprehensions, which I somehow didn't know about.

raminasi
Jan 25, 2005

a last drink with no ice

FoiledAgain posted:

I only understood about 1/8 of that, but the take-away message seemed to be that you should use d={} instead of d=dict(). I'm slightly disappointed by this, because it's my habit to use dict() for new dictionaries. But that doesn't look like the best practice. Am I similarly wasting time using list(), int(), str()? That said, I'm happy to learn about dictionary comprehensions, which I somehow didn't know about.

If I read correctly, the literal syntax is 1.6 to 6 times faster, depending on how many entries are in the dict you're constructing. But the speedup is just in the dict construction. Unless you're creating a bunch of them in an inner loop it almost assuredly won't be a noticeable slowdown, and even then it might not be. In this case there's no real reason not to use the literal syntax (that I'm aware of), but as with all optimizations like this, if you're worrying about it before you profile you're wasting your time.

The Gripper
Sep 14, 2004
i am winner

FoiledAgain posted:

I only understood about 1/8 of that, but the take-away message seemed to be that you should use d={} instead of d=dict(). I'm slightly disappointed by this, because it's my habit to use dict() for new dictionaries. But that doesn't look like the best practice. Am I similarly wasting time using list(), int(), str()? That said, I'm happy to learn about dictionary comprehensions, which I somehow didn't know about.
It really doesn't make a huge amount of difference. {} is literally an empty dict, [] an empty list, "" an empty string and 0 technically an empty int.

I don't think many people use the paramaterless dict() list() int() str() constructors, rather than just using the empty type versions of them.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

FoiledAgain posted:

I only understood about 1/8 of that, but the take-away message seemed to be that you should use d={} instead of d=dict(). I'm slightly disappointed by this, because it's my habit to use dict() for new dictionaries. But that doesn't look like the best practice. Am I similarly wasting time using list(), int(), str()? That said, I'm happy to learn about dictionary comprehensions, which I somehow didn't know about.

If you ever get to the point where you're swapping {}s with dict()s to squeeze the last 1% out of your program, chances are you should be rewriting some parts in a C module or something.

Pudgygiant
Apr 8, 2004

Garnet and black? More like gold and blue or whatever the fuck colors these are
Ok, this is the ugly rear end amateur hour script I have after switching to Yahoo finance. So much easier to pull down a csv than it is to scrape.

code:
import csv
from urllib.request import urlopen

def list_import(filename):
    uservar = filename
    filename = open(filename + ".csv").read()
    filename = filename.replace('"', '')
    filename = filename.replace(', ', '')
    filename = filename.split('\n')
    listlen = len(filename) -1
    del filename[listlen]
    urlbase = 'http://finance.yahoo.com/d/quotes.csv?s='
    listlen = len(filename)
    listnum = 0
    filenamea = []
    while (listnum < listlen):  #getting quotes
        filenamea.append([filename[listnum]])
        filenamea[listnum].append(urlopen(urlbase + filename[listnum] + '&f=a2').read())
        filenamea[listnum][1] = filenamea[listnum][1].decode('UTF-8')
        filenamea[listnum][1] = str(filenamea[listnum][1])
        filenamea[listnum][1] = filenamea[listnum][1].replace('\\r\\n', '')
        filenamea[listnum][1] = filenamea[listnum][1].replace('\r\\n', '')
        filenamea[listnum][1] = filenamea[listnum][1].replace('N/A', '0')
        filenamea[listnum][1] = filenamea[listnum][1].replace(' ', '')
        filenamea[listnum][1] = filenamea[listnum][1].replace(',', '')
        filenamea[listnum][1] = filenamea[listnum][1].replace('.', '')
        filenamea[listnum][1] = int(filenamea[listnum][1])
        currprice = urlopen(urlbase + filename[listnum] + '&f=a').read()
        currprice = currprice.decode('UTF-8')
        currprice = currprice.replace('\r\n', '')
        currprice = currprice.replace(' ', '')
        currprice = currprice.replace(',', '')
        currprice = currprice.replace('.', '')
        currprice = currprice.replace('N/A', '0')
        currprice = float(currprice)
        filenamea[listnum].append(currprice)
        filenamea[listnum].append(filenamea[listnum][1]*filenamea[listnum][2])
        filenamea[listnum][3] = round(filenamea[listnum][3])
        listnum = listnum + 1
    filenameb = []
    listnum = 0
    while (listnum < listlen):
	if filenamea[listnum][1] > 1:
		filenameb.append([filenamea[listnum][0]])
		filenameb[listnum].append(filenamea[listnum][1])
       	listnum = listnum + 1
    myfile = open(uservar + 'vol.csv', "w")
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(filenameb)
    return filenameb
On a scale of 1-10, how terrible is it? Keeping in mind that I'm less than functionally retarded when it comes to programming.

OnceIWasAnOstrich
Jul 22, 2006

Pudgygiant posted:

Ok, this is the ugly rear end amateur hour script I have after switching to Yahoo finance. So much easier to pull down a csv than it is to scrape.

On a scale of 1-10, how terrible is it? Keeping in mind that I'm less than functionally retarded when it comes to programming.

I don't really know what this is doing, you appear to be loading a CSV as a normal text file and parsing it manually and then using csv to write out a CSV instead of using it for input as well. You are also doing a whole lot of string replacements of whatever file you are pulling from Yahoo Finance, are those CSV files also? You are doing a whole lot of things manually that Python should be able to do for you.

Pudgygiant
Apr 8, 2004

Garnet and black? More like gold and blue or whatever the fuck colors these are
Yeah, it pulls down a couple different CSVs from Yahoo, cleans them up, and consolidates the information for every stock on NASDAQ. This is just determining market cap, because the option for market cap gives it in K, M, or B so it was easier to do the math. I probably reinvented the wheel on some of it but doing a bunch of string replaces seemed easier than learning a new API when I'm on the baby's first project phase of learning Python, and I'm leasing an 8 core Xeon server with 16gb of RAM for other unrelated things so efficiency isn't exactly key.

Opinion Haver
Apr 9, 2007

It's not just efficiency, it's correctness and readability. The csv module is dead simple for input as well as output.

Adbot
ADBOT LOVES YOU

AlexG
Jul 15, 2004
If you can't solve a problem with gaffer tape, it's probably insoluble anyway.

Pudgygiant posted:

On a scale of 1-10, how terrible is it? Keeping in mind that I'm less than functionally retarded when it comes to programming.

Here's some stuff that I noticed while trying to figure your script out.

  1. It is one giant function that is doing several different jobs. It is often more convenient to separate out the various tasks, which makes them easier to understand, test, and reuse. For example, you start off by getting a list of stock symbols from a file. That could be its own function. Then, it's easier to run your query against a stock list from a different source, or to use the same list to do something different. Equally, you are doing almost the same things to sanitize currprice and filenamea[listnum][1] - assuming these are meant to be the same steps in both cases, they could be extracted out into a single function which you can call as required. It is a bit unclean to have a function which returns a value (filenameb) but also writes that to a file as a side effect. And so on.
  2. The way you reuse the 'filename' variable is confusing. It starts off as a filename, then contains the contents of the file, and finally is a list of strings. Also, 'filenamea' has nothing to do with filenames as far as I can see, except that its entries are indexed by the elements in filename (which is no longer a filename at this point).
  3. In filenamea and filenameb you are basically reinventing a Python dictionary - you've got some keys (the stock symbols) and associated values. Python can take care of this sort of thing for you - there are a lot of language/library features to simplify working with data of this form.
  4. In this version of the code as presented, you don't seem to be using the 2 and 3 indices into the filenamea sublists - things get put there, but never taken out.
  5. You usually do not need to iterate over lists using a counter. Using indices like this can often lead you a bit astray in the way you think about the code. If you start off by writing "for stock in stocks:" then you will be naturally led in a better direction and the code will probably end up cleaner. Then, using comprehensions to build lists or dictionaries will come quite naturally.

This is actually not too terrible, in my opinion, and it would not take much to make it reasonably "natural" Python - the problems are mostly to do with not taking advantage of some stuff that's already in the language and library. Splitting the code into several functions would also help a lot.

  • Locked thread