Python information and short questions megathread.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »

QuarkJets: Sep 8, 2008

Thermopyle posted:

Of course, Google also has a real hard-on for Go, so maybe it's just the new and shiny factor?

Wasn't Go also created at Google?

# ? Jan 4, 2017 23:02

Adbot: ADBOT LOVES YOU

# ? May 19, 2024 22:37

Cingulate: Oct 23, 2012; by Fluffdaddy

Thermopyle posted:

I always use pip to install stuff in my conda environments

Why do you do that?
I use conda wherever I can (which is almost always), particularly cause 1. I can do conda upgrade --all, 2. it'll take care of dependencies while respecting my MKL numpy.

# ? Jan 4, 2017 23:04

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

Cingulate posted:

Why do you do that?
I use conda wherever I can (which is almost always), particularly cause 1. I can do conda upgrade --all, 2. it'll take care of dependencies while respecting my MKL numpy.

Because conda is usually (at least it used to be, I haven't checked in a while) behind on versions of packages I use. Lots of packages aren't on conda, so then I use pip anyway and it is (was?) a pain to maintain packages with both conda and pip. requirements.txt is widely used, so I can easily use upstream generated requirements.txt or downstream can use my requirements.txt.

And it's not very good practice to upgrade all your packages in one fell swoop, you should pin your packages at specific versions and only upgrade packages that you need to.

Basically, I've found no upside to using conda-provided packages and some downsides.

# ? Jan 4, 2017 23:41

accipter: Sep 12, 2003

Thermopyle posted:

Because conda is usually (at least it used to be, I haven't checked in a while) behind on versions of packages I use. Lots of packages aren't on conda, so then I use pip anyway and it is (was?) a pain to maintain packages with both conda and pip. requirements.txt is widely used, so I can easily use upstream generated requirements.txt or downstream can use my requirements.txt.

And it's not very good practice to upgrade all your packages in one fell swoop, you should pin your packages at specific versions and only upgrade packages that you need to.

Basically, I've found no upside to using conda-provided packages and some downsides.

Do you work much with the numpy/scipy? I found that these basically require using conda for package management on Windows.

# ? Jan 5, 2017 00:29

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

accipter posted:

Do you work much with the numpy/scipy? I found that these basically require using conda for package management on Windows.

Not at all, but I do all my work in Ubuntu virtual machines on my Windows host machine, so I guess I probably wouldn't have a problem anyway.

If for some reason I had to work on a Windows machine with numpy, I'd probably use conda to install numpy and still use pip for everything else.

# ? Jan 5, 2017 00:36

Hughmoris: Apr 21, 2007; Let's go to the abyss!

I can't think of a simple way to accomplish the following:

I have a CSV file that varies in size (currently 25k rows), with each line being a pair of names. I want to count how many times that each pair of names appear throughout the file, regardless of their order: "James,Sarah" is equal to "Sarah,James" for counting purposes.

Can I just convert the string to some sort of numerical value, then count how many times each numerical value appears in the file?

# ? Jan 5, 2017 01:44

Cingulate: Oct 23, 2012; by Fluffdaddy

Hughmoris posted:

I can't think of a simple way to accomplish the following:

I have a CSV file that varies in size (currently 25k rows), with each line being a pair of names. I want to count how many times that each pair of names appear throughout the file, regardless of their order: "James,Sarah" is equal to "Sarah,James" for counting purposes.

Can I just convert the string to some sort of numerical value, then count how many times each numerical value appears in the file?

Like this?

code:

import pandas as pd
df = pd.read_csv(cvs_name, names=["0", "1"])
counts = df["0"].value_counts()
for key, value in df["1"].value_counts():
    counts[key] = value + counts.gey(key, 0)

I didn't try it, if it doesn't work, I can think about it again.

Cingulate fucked around with this message at 01:55 on Jan 5, 2017

# ? Jan 5, 2017 01:50

Jose Cuervo: Aug 25, 2004

Cingulate posted:

Like this?

code:

import pandas as pd
df = pd.read_csv(cvs_name, names=["0", "1"])
counts = df["0"].value_counts()
for key, value in df["1"].value_counts():
    counts[key] = value + counts.gey(key, 0)

I didn't try it, if it doesn't work, I can think about it again.

I don't think that works - what if you have (James, Sarah) and (Julie, James)? If I understand your code correctly, you would have counts['James'] as 2, and counts['Julie'] and counts['Sarah'] as 1, right? Whereas I believe HughMorris wants (James, Sarah) to have a count of 1, and (Julie, James) to have a count of 1.

I think this will work (although it may be slow?).

code:

import pandas as pd
df = pd.read_csv(cvs_name, names=['first_name', 'second_name'])

pairs = set()
pair_count = {}
for idx, row in df.iterrows():
    if (row['first_name'], row['second_name']) in pairs():
        pair_count[(row['first_name'], row['second_name'])] += 1
    else:
        pairs.add((row['first_name'], row['second_name']))
        pairs.add((row['second_name'], row['first_name'])
        pair_count[(row['first_name'], row['second_name'])] = 1

Jose Cuervo fucked around with this message at 02:18 on Jan 5, 2017

# ? Jan 5, 2017 02:05

Cingulate: Oct 23, 2012; by Fluffdaddy

Jose Cuervo posted:

I don't think that works - what if you have (James, Sarah) and (Julie, James)? If I understand your code correctly, you would have counts['James'] as 2, and counts['Julie'] and counts['Sarah'] as 1, right? Whereas I believe HughMorris wants (James, Sarah) to have a count of 1, and (Julie, James) to have a count of 1.

Ah I get it.

code:

items = "james,sarah;sarah,james;lisa,sarah".split(";")

counts = dict()
for line in items:
    line = ",".join(sorted(line.split(",")))
    counts[line] = 1 + counts.get(line, 0)
counts

Gives {'james,sarah': 2, 'lisa,sarah': 1}

You can do it more elegantly (... particularly with the dict ordering in python 3.6 ...), but maybe it's a start.

As a one-liner:

code:

from collections import Counter; Counter([",".join(sorted(item.split(","))) for item in items])

E: another, probably better

code:

from collections import Counter; Counter([tuple(sorted(item.split(","))) for item in items])

Cingulate fucked around with this message at 02:23 on Jan 5, 2017

# ? Jan 5, 2017 02:18

Eela6: May 25, 2007; Shredded Hen

This should do it. The trick is to sort your name-tuples. Here I use a simple comparison rather than sorted() because that way we don't need to go tuple > list > tuple - we can just go tuple -> tuple.

The answer to whether you can convert to some numerical value and count how many times that occurs is 'yes', that's how the __hash__() magic method works under the hood when you index by tuples in a dictionary.

Cingulate's solution is just fine, but I prefer using the CSV module to read CSVS rather than splitting and joining on my own.

I think this is the fastest and best solution so far, but they're all basically fine.

Python code:

import csv
from collections import Counter

def _generate_sorted_names(filename):
    with open(filename) as input:
        reader = csv.reader(input)
        for first, second in reader:
            if first <= second:
                yield first, second
            else: 
                yield second, first
        
def count_name_pairs(filename):
    return Counter(_generate_sorted_names(filename))

Eela6 fucked around with this message at 03:23 on Jan 5, 2017

# ? Jan 5, 2017 02:45

Hughmoris: Apr 21, 2007; Let's go to the abyss!

Cingulate posted:

As a one-liner:

code:

from collections import Counter; Counter([tuple(sorted(item.split(","))) for item in items])

Eela6 posted:

Python code:

import csv
from collections import Counter

def _generate_sorted_names(filename):
    with open(filename) as input:
        reader = csv.reader(input)
        for first, second in reader:
            if first <= second:
                yield first, second
            else: 
                yield second, first
        
def count_name_pairs(filename):
    return Counter(_generate_sorted_names(filename))

I'll give these a go. Thanks!

# ? Jan 5, 2017 04:28

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

CPython is really, really bad and even with the new "asyncio" stuff even something like node.js can run circles around it. Python's threading model is the same as Ruby's, which is to say: non-existent. Any object can touch theoretically touch any object.

I like GrumPy's approach because all those restrictions don't matter for good real-world code anyway. PyPy went batshit crazy trying to support all the Python use-cases whereas GrumPy just cares about "Python, the good parts".

Perhaps the next thing GrumPy should do is add a better threading model to Python, and maybe some of the newer features like async support, and then we'll get a cool, modern Python 2 runtime.

# ? Jan 5, 2017 06:07

more like dICK: Feb 15, 2010; This is inevitable.

Python the good parts apparently doesn't include the standard library

# ? Jan 5, 2017 14:29

Munkeymon: Aug 14, 2003; Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.

QuarkJets posted:

Wasn't Go also created at Google?

Yeah

Suspicious Dish posted:

CPython is really, really bad and even with the new "asyncio" stuff even something like node.js can run circles around it.

I said this in a different thread, I think, but Mozilla says you can get a pretty good performance boost by compiling Python to Web Assembly or asmjs (or whatever it's called this year) and running the result on SpiderMonkey. It's probably not a terribly fair comparison, though, because JS interpreters don't have to care about threading or C interop. Still kinda neat, though.

# ? Jan 5, 2017 15:05

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

Cingulate posted:

code:
d = np.random.random(10000)
out = d[d < .5]
Can I somehow get this - selecting values from d based on applying a condition to d - without actually explicitly assigning d?

Not with numpy's logical indexing, I don't think. If you're just doing a simple list or generator comprehension you could, eg:

(x for x in function_that_generates_d() where condition(x))

But if you're doing logical indexing I believe you need to assign at least once. Is there a specific reason you need not to assign d, or are you just curious?

# ? Jan 10, 2017 19:34

Cingulate: Oct 23, 2012; by Fluffdaddy

Eela6 posted:

Not with numpy's logical indexing, I don't think. If you're just doing a simple list or generator comprehension you could, eg:

(x for x in function_that_generates_d() where condition(x))

But if you're doing logical indexing I believe you need to assign at least once. Is there a specific reason you need not to assign d, or are you just curious?

Yeah I had originally written it as a list comp, but I wanted it all vectorized ...

There is no burning need - I'm just curious.

I basically just wanted to do something like

code:

np.get_where(scipy.stats.ttest_1samp(some_data)[1], cond="<.05")

I.e., generate a 1000 numbers vector and only retrieve the values smaller than .05, but without ever assigning the vector.

# ? Jan 10, 2017 20:04

QuarkJets: Sep 8, 2008

Is this just to have it be done in a one-liner?

# ? Jan 10, 2017 21:38

Cingulate: Oct 23, 2012; by Fluffdaddy

QuarkJets posted:

Is this just to have it be done in a one-liner?

I'm mainly trying to avoid having an extra variable.

# ? Jan 10, 2017 21:40

more like dICK: Feb 15, 2010; This is inevitable.

Variables are cheap.

# ? Jan 10, 2017 21:44

Cingulate: Oct 23, 2012; by Fluffdaddy

Cingulate posted:

Not necessarily cognitively.

I don't understand. Do you mean that the code becomes harder to read by having to assign the array to a variable before indexing with it? I think a complicated one-liner is actually the more difficult option

If you are indexing the array with itself, that creates a new array, and you can assign that to the same variable name; no need to delete anything, as nothing refers to the old array then

QuarkJets fucked around with this message at 22:11 on Jan 10, 2017

# ? Jan 10, 2017 22:09

Cingulate: Oct 23, 2012; by Fluffdaddy

QuarkJets posted:

I don't understand. Do you mean that the code becomes harder to read by having to assign the array to a variable before indexing with it? I think a complicated one-liner is actually the more difficult option

A complicated one, yes. But if there was a simple option ...

QuarkJets posted:

If you are indexing the array with itself, that creates a new array, and you can assign that to the same variable name

... that's just what I did in the post you're replying to, right?

# ? Jan 10, 2017 22:16

QuarkJets: Sep 8, 2008

Right. And for the earlier post, instead of assigning the result to "out" you assign it to "d"

(and ideally you don't use a single character variable name like "d" but that's just me)

# ? Jan 10, 2017 22:36

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

I mostly use conda nowadays just so I can easily install different python versions for different projects. I can't remember the last time I used it to install an actual package that was hard to install on my OS.

I've been thinking about switching to pyenv + virtualenv/venv.

Does anyone have any thoughts about conda vs pyenv + virtualenv/venv they'd like to share?

# ? Jan 11, 2017 01:31

Emmideer: Oct 20, 2011; Lovely night, no?; Grimey Drawer

Does the python len() function actually count every item in the list when it's called, or does it use some shortcut?

# ? Jan 11, 2017 15:49

Asymmetrikon: Oct 30, 2009; I believe you're a big dork!

According to this post, the length of a list is stored in the list object and it looks like len() just returns that value (in CPython, at least.)

# ? Jan 11, 2017 15:56

Hammerite: Mar 9, 2007; And you don't remember what I said here, either, but it was pompous and stupid.; Jade Ear Joe

jon joe posted:

Does the python len() function actually count every item in the list when it's called, or does it use some shortcut?

Objects representing a finite collection are generally supposed to adhere to a protocol whereby they define a specially-named method, __len__(), which is called by len(). That way, it is up to a given collection class to implement its own response to len(). A given collection class might cache the number of elements so that it can respond quickly when asked how big it is, or it might just laboriously count its items every time; it's up to the implementer.

# ? Jan 11, 2017 17:59

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

jon joe posted:

Does the python len() function actually count every item in the list when it's called, or does it use some shortcut?

How would it know when to stop counting if it doesn't know how many items are the list?

# ? Jan 11, 2017 18:39

Asymmetrikon: Oct 30, 2009; I believe you're a big dork!

Suspicious Dish posted:

How would it know when to stop counting if it doesn't know how many items are the list?

I don't know how much of Python's standard types are defined by spec or left up to implementation; wouldn't it be technically possible for list to be backed as a linked list?

# ? Jan 11, 2017 18:41

Eela6: May 25, 2007; Shredded Hen

Asymmetrikon posted:

I don't know how much of Python's standard types are defined by spec or left up to implementation; wouldn't it be technically possible for list to be backed as a linked list?

Python has no formal spec; it's defined by implementation. I made a cursory search of the Python Language reference and I don't see anything stopping you except good taste.

# ? Jan 11, 2017 19:05

Jose Cuervo: Aug 25, 2004

I am getting into web scraping and trying out BeautifulSoup for the first time. Here is my code for getting the name and address of a particular gas station, along with the prices of all kinds of gas it has information for.

Python code:

from bs4 import BeautifulSoup
import urllib2

response = urllib2.urlopen('https://www.gasbuddy.com/Station/47568')
soup = BeautifulSoup(response.read(), 'html.parser')

station_div = soup.find('div', class_='stationDetails')
station_name = station_div.h2.string
station_phone = station_div.find('div', class_='station-phone').string
station_street_address = station_div.find('div', class_='station-address').string
station_area_div = station_div.find('div', class_='station-area')
station_area = station_area_div.find('span', itemprop='addressLocality').string + ', ' + \
				station_area_div.find('span', itemprop='addressRegion').string + ' ' + \
				station_area_div.find('span', itemprop='postalCode').string
print station_name
print station_phone
print station_street_address
print station_area

for fuel_type_label_div in soup.find('div', id='prices').find_all('h4'):
	fuel_type_div = fuel_type_label_div.parent
	fuel_type = fuel_type_label_div.string
	try:
		price = fuel_type_div.find('div', class_='price-display credit-price').string
		last_updated = fuel_type_div.find('div', class_='price-time').string
	except AttributeError:
		price = '--'
		last_updated = '--'
	
	print fuel_type, price, last_updated

Is this the right way to do this kind of thing, or am I being super inefficient in the way I extract the data from the webpage?

# ? Jan 11, 2017 20:00

StormyDragon: Jun 3, 2011

Jose Cuervo posted:

I am getting into web scraping and trying out BeautifulSoup for the first time. Here is my code for getting the name and address of a particular gas station, along with the prices of all kinds of gas it has information for.
Python code:
Is this the right way to do this kind of thing, or am I being super inefficient in the way I extract the data from the webpage?

The select function is a lot more useful as you won't have to work too much with the specific tags unless the HTML is particularly atrocious.

Python code:

from __future__ import print_function
from collections import namedtuple

from bs4 import BeautifulSoup
import requests

Address = namedtuple("Address", ["address", "locality", "region", "postal_code"])

url = "https://www.gasbuddy.com/Station/47568"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')

address = Address(
    address=soup.select("[itemprop=streetAddress]")[0].text,
    locality=soup.select("[itemprop=addressLocality]")[0].text,
    region=soup.select("[itemprop=addressRegion]")[0].text,
    postal_code=soup.select("[itemprop=postalCode]")[0].text,
)

type_price = dict((title.text, (price.text, update.text)) for title, price, update in zip(
    soup.select("#prices .section-title"),
    soup.select("#prices .credit-box .price-display"),
    soup.select("#prices .credit-box .price-time")
) if price.text != '--')

print(address)
print(type_price)

# ? Jan 11, 2017 21:07

Jose Cuervo: Aug 25, 2004

StormyDragon posted:

The select function is a lot more useful as you won't have to work too much with the specific tags unless the HTML is particularly atrocious.

Python code:

from __future__ import print_function
from collections import namedtuple

from bs4 import BeautifulSoup
import requests

Address = namedtuple("Address", ["address", "locality", "region", "postal_code"])

url = "https://www.gasbuddy.com/Station/47568"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')

address = Address(
    address=soup.select("[itemprop=streetAddress]")[0].text,
    locality=soup.select("[itemprop=addressLocality]")[0].text,
    region=soup.select("[itemprop=addressRegion]")[0].text,
    postal_code=soup.select("[itemprop=postalCode]")[0].text,
)

type_price = dict((title.text, (price.text, update.text)) for title, price, update in zip(
    soup.select("#prices .section-title"),
    soup.select("#prices .credit-box .price-display"),
    soup.select("#prices .credit-box .price-time")
) if price.text != '--')

print(address)
print(type_price)

Wow, that looks much nicer than what I was doing.

Questions:
1. I was reading through the BeautifulSoup documentation for using .select and did not see where something like .select("[itemprop=streetAddress]") was defined. How did you know to do that?
EDIT: Ah, this must be 'Find tags by attribute value'.
1.b. And how do you know when to use # vs . ? For example in soup.select("#prices .credit-box .price-display")?
1.c. And why does soup.select("#prices .fuel-type .section-title") return an empty list and not the same list as soup.select("#prices .section-title")?
EDIT: Is this because .fuel_type and .section-title are both class values of the same div? soup.select("#prices .credit-box .price-display") works because it first locates the 'prices' div, then the 'credit-box' div inside the 'prices' div, and finally the 'price-display' div that is inside the 'credit-box' div, right?
2. When creating the type_price dictionary, does BeautifulSoup guarantee that you will have things returned in the order found on the page? That is, will the order of items in the list returned by soup.select("#prices .credit-box .price-display") necessarily be the same as the order of items in the list returned by soup.select("#prices .credit-box .price-time")?

Jose Cuervo fucked around with this message at 22:48 on Jan 11, 2017

# ? Jan 11, 2017 22:25

baka kaba: Jul 19, 2003; PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

They're on there, that documentation is kinda dense and hard to get a clear overview of what's available, I think

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

# ? Jan 11, 2017 22:35

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

Jose Cuervo posted:

1.b. And how do you know when to use # vs . ? For example in soup.select("#prices .credit-box .price-display")?

Read about CSS selectors.

# ? Jan 11, 2017 22:45

Nippashish: Nov 2, 2005; Let me see you dance!

I usually end up with a pattern that looks like this:

Python code:

class LazyLoadingPage(object):
  def __init__(self, url):
    self.url = url
    self._req = None
    self._soup = None

  @property
  def req(self):
    if self._req is None:
      self._req = requests.get(self.url)
    return self._req

  @property
  def soup(self):
    if self._soup is None:
      self._soup = BeautifulSoup(self.req.text, "html.parser")
    return self._soup

  def __str__(self):
    return "{cls}(url: {url})".format(
      cls=self.__class__.__name__,
      url=self.url)

  def __repr__(self):
    return str(self)

class Station(LazyLoadingPage):
  @property
  def address(self):
    return Address(
        address=self.soup.select("[itemprop=streetAddress]")[0].text,
        locality=self.soup.select("[itemprop=addressLocality]")[0].text,
        region=self.soup.select("[itemprop=addressRegion]")[0].text,
        postal_code=self.soup.select("[itemprop=postalCode]")[0].text,
    )

  @property
  def types(self):
    return list(title.text for title in self.soup.select("#prices .section-title"))

  @property
  def type_price(self):
    return dict((title.text, (price.text, update.text)) for title, price, update in zip(
        self.soup.select("#prices .section-title"),
        self.soup.select("#prices .credit-box .price-display"),
        self.soup.select("#prices .credit-box .price-time")
    ) if price.text != '--')

In this case it ends up just being a wordier version of what StormyDragon already posted, but when you want to want to pull a bunch of information off a complicated site the extra layer of objects can be nice.

# ? Jan 11, 2017 23:06

Adbot: ADBOT LOVES YOU

# ? May 19, 2024 22:37

StormyDragon: Jun 3, 2011

Jose Cuervo posted:

2. When creating the type_price dictionary, does BeautifulSoup guarantee that you will have things returned in the order found on the page? That is, will the order of items in the list returned by soup.select("#prices .credit-box .price-display") necessarily be the same as the order of items in the list returned by soup.select("#prices .credit-box .price-time")?

You will always get items in the order they were found in the html hierarchy that the selector is navigating. Though you have to be careful that the page doesn't omit values otherwise you get instances of zipping up items that don't match.
I did try using the class selector .credit-price but it turns out that this class is not attached to price displays without a price, fortunately there were another css class on there.

# ? Jan 11, 2017 23:16

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »