Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
QuarkJets
Sep 8, 2008

Thermopyle posted:

Of course, Google also has a real hard-on for Go, so maybe it's just the new and shiny factor?

Wasn't Go also created at Google?

Adbot
ADBOT LOVES YOU

Cingulate
Oct 23, 2012

by Fluffdaddy

Thermopyle posted:

I always use pip to install stuff in my conda environments
Why do you do that?
I use conda wherever I can (which is almost always), particularly cause 1. I can do conda upgrade --all, 2. it'll take care of dependencies while respecting my MKL numpy.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Cingulate posted:

Why do you do that?
I use conda wherever I can (which is almost always), particularly cause 1. I can do conda upgrade --all, 2. it'll take care of dependencies while respecting my MKL numpy.

Because conda is usually (at least it used to be, I haven't checked in a while) behind on versions of packages I use. Lots of packages aren't on conda, so then I use pip anyway and it is (was?) a pain to maintain packages with both conda and pip. requirements.txt is widely used, so I can easily use upstream generated requirements.txt or downstream can use my requirements.txt.

And it's not very good practice to upgrade all your packages in one fell swoop, you should pin your packages at specific versions and only upgrade packages that you need to.

Basically, I've found no upside to using conda-provided packages and some downsides.

accipter
Sep 12, 2003

Thermopyle posted:

Because conda is usually (at least it used to be, I haven't checked in a while) behind on versions of packages I use. Lots of packages aren't on conda, so then I use pip anyway and it is (was?) a pain to maintain packages with both conda and pip. requirements.txt is widely used, so I can easily use upstream generated requirements.txt or downstream can use my requirements.txt.

And it's not very good practice to upgrade all your packages in one fell swoop, you should pin your packages at specific versions and only upgrade packages that you need to.

Basically, I've found no upside to using conda-provided packages and some downsides.

Do you work much with the numpy/scipy? I found that these basically require using conda for package management on Windows.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

accipter posted:

Do you work much with the numpy/scipy? I found that these basically require using conda for package management on Windows.

Not at all, but I do all my work in Ubuntu virtual machines on my Windows host machine, so I guess I probably wouldn't have a problem anyway.

If for some reason I had to work on a Windows machine with numpy, I'd probably use conda to install numpy and still use pip for everything else.

Hughmoris
Apr 21, 2007
Let's go to the abyss!
I can't think of a simple way to accomplish the following:

I have a CSV file that varies in size (currently 25k rows), with each line being a pair of names. I want to count how many times that each pair of names appear throughout the file, regardless of their order: "James,Sarah" is equal to "Sarah,James" for counting purposes.

Can I just convert the string to some sort of numerical value, then count how many times each numerical value appears in the file?

Cingulate
Oct 23, 2012

by Fluffdaddy

Hughmoris posted:

I can't think of a simple way to accomplish the following:

I have a CSV file that varies in size (currently 25k rows), with each line being a pair of names. I want to count how many times that each pair of names appear throughout the file, regardless of their order: "James,Sarah" is equal to "Sarah,James" for counting purposes.

Can I just convert the string to some sort of numerical value, then count how many times each numerical value appears in the file?
Like this?
code:
import pandas as pd
df = pd.read_csv(cvs_name, names=["0", "1"])
counts = df["0"].value_counts()
for key, value in df["1"].value_counts():
    counts[key] = value + counts.gey(key, 0)
I didn't try it, if it doesn't work, I can think about it again.

Cingulate fucked around with this message at 01:55 on Jan 5, 2017

Jose Cuervo
Aug 25, 2004

Cingulate posted:

Like this?
code:
import pandas as pd
df = pd.read_csv(cvs_name, names=["0", "1"])
counts = df["0"].value_counts()
for key, value in df["1"].value_counts():
    counts[key] = value + counts.gey(key, 0)
I didn't try it, if it doesn't work, I can think about it again.

I don't think that works - what if you have (James, Sarah) and (Julie, James)? If I understand your code correctly, you would have counts['James'] as 2, and counts['Julie'] and counts['Sarah'] as 1, right? Whereas I believe HughMorris wants (James, Sarah) to have a count of 1, and (Julie, James) to have a count of 1.

I think this will work (although it may be slow?).

code:
import pandas as pd
df = pd.read_csv(cvs_name, names=['first_name', 'second_name'])

pairs = set()
pair_count = {}
for idx, row in df.iterrows():
    if (row['first_name'], row['second_name']) in pairs():
        pair_count[(row['first_name'], row['second_name'])] += 1
    else:
        pairs.add((row['first_name'], row['second_name']))
        pairs.add((row['second_name'], row['first_name'])
        pair_count[(row['first_name'], row['second_name'])] = 1

Jose Cuervo fucked around with this message at 02:18 on Jan 5, 2017

Cingulate
Oct 23, 2012

by Fluffdaddy

Jose Cuervo posted:

I don't think that works - what if you have (James, Sarah) and (Julie, James)? If I understand your code correctly, you would have counts['James'] as 2, and counts['Julie'] and counts['Sarah'] as 1, right? Whereas I believe HughMorris wants (James, Sarah) to have a count of 1, and (Julie, James) to have a count of 1.
Ah I get it.

code:
items = "james,sarah;sarah,james;lisa,sarah".split(";")

counts = dict()
for line in items:
    line = ",".join(sorted(line.split(",")))
    counts[line] = 1 + counts.get(line, 0)
counts
Gives {'james,sarah': 2, 'lisa,sarah': 1}

You can do it more elegantly (... particularly with the dict ordering in python 3.6 ...), but maybe it's a start.

As a one-liner:
code:
from collections import Counter; Counter([",".join(sorted(item.split(","))) for item in items])
E: another, probably better

code:
from collections import Counter; Counter([tuple(sorted(item.split(","))) for item in items])

Cingulate fucked around with this message at 02:23 on Jan 5, 2017

Eela6
May 25, 2007
Shredded Hen
This should do it. The trick is to sort your name-tuples. Here I use a simple comparison rather than sorted() because that way we don't need to go tuple > list > tuple - we can just go tuple -> tuple.

The answer to whether you can convert to some numerical value and count how many times that occurs is 'yes', that's how the __hash__() magic method works under the hood when you index by tuples in a dictionary.

Cingulate's solution is just fine, but I prefer using the CSV module to read CSVS rather than splitting and joining on my own.

I think this is the fastest and best solution so far, but they're all basically fine.
Python code:
import csv
from collections import Counter

def _generate_sorted_names(filename):
    with open(filename) as input:
        reader = csv.reader(input)
        for first, second in reader:
            if first <= second:
                yield first, second
            else: 
                yield second, first
        
def count_name_pairs(filename):
    return Counter(_generate_sorted_names(filename))

Eela6 fucked around with this message at 03:23 on Jan 5, 2017

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Cingulate posted:

As a one-liner:
code:
from collections import Counter; Counter([tuple(sorted(item.split(","))) for item in items])

Eela6 posted:

Python code:
import csv
from collections import Counter

def _generate_sorted_names(filename):
    with open(filename) as input:
        reader = csv.reader(input)
        for first, second in reader:
            if first <= second:
                yield first, second
            else: 
                yield second, first
        
def count_name_pairs(filename):
    return Counter(_generate_sorted_names(filename))

I'll give these a go. Thanks!

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
CPython is really, really bad and even with the new "asyncio" stuff even something like node.js can run circles around it. Python's threading model is the same as Ruby's, which is to say: non-existent. Any object can touch theoretically touch any object.

I like GrumPy's approach because all those restrictions don't matter for good real-world code anyway. PyPy went batshit crazy trying to support all the Python use-cases whereas GrumPy just cares about "Python, the good parts".

Perhaps the next thing GrumPy should do is add a better threading model to Python, and maybe some of the newer features like async support, and then we'll get a cool, modern Python 2 runtime.

more like dICK
Feb 15, 2010

This is inevitable.
Python the good parts apparently doesn't include the standard library

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



QuarkJets posted:

Wasn't Go also created at Google?

Yeah

Suspicious Dish posted:

CPython is really, really bad and even with the new "asyncio" stuff even something like node.js can run circles around it.

I said this in a different thread, I think, but Mozilla says you can get a pretty good performance boost by compiling Python to Web Assembly or asmjs (or whatever it's called this year) and running the result on SpiderMonkey. It's probably not a terribly fair comparison, though, because JS interpreters don't have to care about threading or C interop. Still kinda neat, though.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

more like dICK posted:

Python the good parts apparently doesn't include the standard library

Yes, that's correct.

BigRedDot
Mar 6, 2008

Another day, another Bokeh: https://bokeh.github.io/blog/2017/1/6/release-0-12-4/

Cingulate
Oct 23, 2012

by Fluffdaddy
code:
d = np.random.random(10000)
out = d[d < .5]
Can I somehow get this - selecting values from d based on applying a condition to d - without actually explicitly assigning d?

Eela6
May 25, 2007
Shredded Hen

Cingulate posted:

code:
d = np.random.random(10000)
out = d[d < .5]
Can I somehow get this - selecting values from d based on applying a condition to d - without actually explicitly assigning d?

Not with numpy's logical indexing, I don't think. If you're just doing a simple list or generator comprehension you could, eg:

(x for x in function_that_generates_d() where condition(x))

But if you're doing logical indexing I believe you need to assign at least once. Is there a specific reason you need not to assign d, or are you just curious?

Cingulate
Oct 23, 2012

by Fluffdaddy

Eela6 posted:

Not with numpy's logical indexing, I don't think. If you're just doing a simple list or generator comprehension you could, eg:

(x for x in function_that_generates_d() where condition(x))

But if you're doing logical indexing I believe you need to assign at least once. Is there a specific reason you need not to assign d, or are you just curious?
Yeah I had originally written it as a list comp, but I wanted it all vectorized ...

There is no burning need - I'm just curious.

I basically just wanted to do something like
code:
np.get_where(scipy.stats.ttest_1samp(some_data)[1], cond="<.05")
I.e., generate a 1000 numbers vector and only retrieve the values smaller than .05, but without ever assigning the vector.

QuarkJets
Sep 8, 2008

Is this just to have it be done in a one-liner?

Cingulate
Oct 23, 2012

by Fluffdaddy

QuarkJets posted:

Is this just to have it be done in a one-liner?
I'm mainly trying to avoid having an extra variable.

more like dICK
Feb 15, 2010

This is inevitable.
Variables are cheap.

Cingulate
Oct 23, 2012

by Fluffdaddy

more like dICK posted:

Variables are cheap.
Not necessarily cognitively.

Also, in cases where the array is large and the rest of the function is long, I have to explicitly delete the reference to free up the memory, or do something like
code:
x = x[x > 1]

QuarkJets
Sep 8, 2008

Cingulate posted:

Not necessarily cognitively.


I don't understand. Do you mean that the code becomes harder to read by having to assign the array to a variable before indexing with it? I think a complicated one-liner is actually the more difficult option

If you are indexing the array with itself, that creates a new array, and you can assign that to the same variable name; no need to delete anything, as nothing refers to the old array then

QuarkJets fucked around with this message at 22:11 on Jan 10, 2017

Cingulate
Oct 23, 2012

by Fluffdaddy

QuarkJets posted:

I don't understand. Do you mean that the code becomes harder to read by having to assign the array to a variable before indexing with it? I think a complicated one-liner is actually the more difficult option
A complicated one, yes. But if there was a simple option ...

QuarkJets posted:

If you are indexing the array with itself, that creates a new array, and you can assign that to the same variable name
... that's just what I did in the post you're replying to, right?

QuarkJets
Sep 8, 2008

Right. And for the earlier post, instead of assigning the result to "out" you assign it to "d"

(and ideally you don't use a single character variable name like "d" but that's just me)

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

I mostly use conda nowadays just so I can easily install different python versions for different projects. I can't remember the last time I used it to install an actual package that was hard to install on my OS.

I've been thinking about switching to pyenv + virtualenv/venv.

Does anyone have any thoughts about conda vs pyenv + virtualenv/venv they'd like to share?

Emmideer
Oct 20, 2011

Lovely night, no?
Grimey Drawer
Does the python len() function actually count every item in the list when it's called, or does it use some shortcut?

Asymmetrikon
Oct 30, 2009

I believe you're a big dork!
According to this post, the length of a list is stored in the list object and it looks like len() just returns that value (in CPython, at least.)

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

jon joe posted:

Does the python len() function actually count every item in the list when it's called, or does it use some shortcut?

Objects representing a finite collection are generally supposed to adhere to a protocol whereby they define a specially-named method, __len__(), which is called by len(). That way, it is up to a given collection class to implement its own response to len(). A given collection class might cache the number of elements so that it can respond quickly when asked how big it is, or it might just laboriously count its items every time; it's up to the implementer.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

jon joe posted:

Does the python len() function actually count every item in the list when it's called, or does it use some shortcut?

How would it know when to stop counting if it doesn't know how many items are the list?

Asymmetrikon
Oct 30, 2009

I believe you're a big dork!

Suspicious Dish posted:

How would it know when to stop counting if it doesn't know how many items are the list?

I don't know how much of Python's standard types are defined by spec or left up to implementation; wouldn't it be technically possible for list to be backed as a linked list?

Eela6
May 25, 2007
Shredded Hen

Asymmetrikon posted:

I don't know how much of Python's standard types are defined by spec or left up to implementation; wouldn't it be technically possible for list to be backed as a linked list?

Python has no formal spec; it's defined by implementation. I made a cursory search of the Python Language reference and I don't see anything stopping you except good taste.

Jose Cuervo
Aug 25, 2004
I am getting into web scraping and trying out BeautifulSoup for the first time. Here is my code for getting the name and address of a particular gas station, along with the prices of all kinds of gas it has information for.

Python code:
from bs4 import BeautifulSoup
import urllib2

response = urllib2.urlopen('https://www.gasbuddy.com/Station/47568')
soup = BeautifulSoup(response.read(), 'html.parser')

station_div = soup.find('div', class_='stationDetails')
station_name = station_div.h2.string
station_phone = station_div.find('div', class_='station-phone').string
station_street_address = station_div.find('div', class_='station-address').string
station_area_div = station_div.find('div', class_='station-area')
station_area = station_area_div.find('span', itemprop='addressLocality').string + ', ' + \
				station_area_div.find('span', itemprop='addressRegion').string + ' ' + \
				station_area_div.find('span', itemprop='postalCode').string
print station_name
print station_phone
print station_street_address
print station_area

for fuel_type_label_div in soup.find('div', id='prices').find_all('h4'):
	fuel_type_div = fuel_type_label_div.parent
	fuel_type = fuel_type_label_div.string
	try:
		price = fuel_type_div.find('div', class_='price-display credit-price').string
		last_updated = fuel_type_div.find('div', class_='price-time').string
	except AttributeError:
		price = '--'
		last_updated = '--'
	
	print fuel_type, price, last_updated
Is this the right way to do this kind of thing, or am I being super inefficient in the way I extract the data from the webpage?

StormyDragon
Jun 3, 2011

Jose Cuervo posted:

I am getting into web scraping and trying out BeautifulSoup for the first time. Here is my code for getting the name and address of a particular gas station, along with the prices of all kinds of gas it has information for.

Python code:
Is this the right way to do this kind of thing, or am I being super inefficient in the way I extract the data from the webpage?

The select function is a lot more useful as you won't have to work too much with the specific tags unless the HTML is particularly atrocious.

Python code:
from __future__ import print_function
from collections import namedtuple

from bs4 import BeautifulSoup
import requests

Address = namedtuple("Address", ["address", "locality", "region", "postal_code"])

url = "https://www.gasbuddy.com/Station/47568"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')

address = Address(
    address=soup.select("[itemprop=streetAddress]")[0].text,
    locality=soup.select("[itemprop=addressLocality]")[0].text,
    region=soup.select("[itemprop=addressRegion]")[0].text,
    postal_code=soup.select("[itemprop=postalCode]")[0].text,
)

type_price = dict((title.text, (price.text, update.text)) for title, price, update in zip(
    soup.select("#prices .section-title"),
    soup.select("#prices .credit-box .price-display"),
    soup.select("#prices .credit-box .price-time")
) if price.text != '--')

print(address)
print(type_price)

Jose Cuervo
Aug 25, 2004

StormyDragon posted:

The select function is a lot more useful as you won't have to work too much with the specific tags unless the HTML is particularly atrocious.

Python code:
from __future__ import print_function
from collections import namedtuple

from bs4 import BeautifulSoup
import requests

Address = namedtuple("Address", ["address", "locality", "region", "postal_code"])

url = "https://www.gasbuddy.com/Station/47568"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')

address = Address(
    address=soup.select("[itemprop=streetAddress]")[0].text,
    locality=soup.select("[itemprop=addressLocality]")[0].text,
    region=soup.select("[itemprop=addressRegion]")[0].text,
    postal_code=soup.select("[itemprop=postalCode]")[0].text,
)

type_price = dict((title.text, (price.text, update.text)) for title, price, update in zip(
    soup.select("#prices .section-title"),
    soup.select("#prices .credit-box .price-display"),
    soup.select("#prices .credit-box .price-time")
) if price.text != '--')

print(address)
print(type_price)

Wow, that looks much nicer than what I was doing.

Questions:
1. I was reading through the BeautifulSoup documentation for using .select and did not see where something like .select("[itemprop=streetAddress]") was defined. How did you know to do that?
EDIT: Ah, this must be 'Find tags by attribute value'.
1.b. And how do you know when to use # vs . ? For example in soup.select("#prices .credit-box .price-display")?
1.c. And why does soup.select("#prices .fuel-type .section-title") return an empty list and not the same list as soup.select("#prices .section-title")?
EDIT: Is this because .fuel_type and .section-title are both class values of the same div? soup.select("#prices .credit-box .price-display") works because it first locates the 'prices' div, then the 'credit-box' div inside the 'prices' div, and finally the 'price-display' div that is inside the 'credit-box' div, right?
2. When creating the type_price dictionary, does BeautifulSoup guarantee that you will have things returned in the order found on the page? That is, will the order of items in the list returned by soup.select("#prices .credit-box .price-display") necessarily be the same as the order of items in the list returned by soup.select("#prices .credit-box .price-time")?

Jose Cuervo fucked around with this message at 22:48 on Jan 11, 2017

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

They're on there, that documentation is kinda dense and hard to get a clear overview of what's available, I think

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Jose Cuervo posted:

1.b. And how do you know when to use # vs . ? For example in soup.select("#prices .credit-box .price-display")?

Read about CSS selectors.

Nippashish
Nov 2, 2005

Let me see you dance!
I usually end up with a pattern that looks like this:

Python code:
class LazyLoadingPage(object):
  def __init__(self, url):
    self.url = url
    self._req = None
    self._soup = None

  @property
  def req(self):
    if self._req is None:
      self._req = requests.get(self.url)
    return self._req

  @property
  def soup(self):
    if self._soup is None:
      self._soup = BeautifulSoup(self.req.text, "html.parser")
    return self._soup

  def __str__(self):
    return "{cls}(url: {url})".format(
      cls=self.__class__.__name__,
      url=self.url)

  def __repr__(self):
    return str(self)

class Station(LazyLoadingPage):
  @property
  def address(self):
    return Address(
        address=self.soup.select("[itemprop=streetAddress]")[0].text,
        locality=self.soup.select("[itemprop=addressLocality]")[0].text,
        region=self.soup.select("[itemprop=addressRegion]")[0].text,
        postal_code=self.soup.select("[itemprop=postalCode]")[0].text,
    )

  @property
  def types(self):
    return list(title.text for title in self.soup.select("#prices .section-title"))

  @property
  def type_price(self):
    return dict((title.text, (price.text, update.text)) for title, price, update in zip(
        self.soup.select("#prices .section-title"),
        self.soup.select("#prices .credit-box .price-display"),
        self.soup.select("#prices .credit-box .price-time")
    ) if price.text != '--')
In this case it ends up just being a wordier version of what StormyDragon already posted, but when you want to want to pull a bunch of information off a complicated site the extra layer of objects can be nice.

Adbot
ADBOT LOVES YOU

StormyDragon
Jun 3, 2011

Jose Cuervo posted:

2. When creating the type_price dictionary, does BeautifulSoup guarantee that you will have things returned in the order found on the page? That is, will the order of items in the list returned by soup.select("#prices .credit-box .price-display") necessarily be the same as the order of items in the list returned by soup.select("#prices .credit-box .price-time")?

You will always get items in the order they were found in the html hierarchy that the selector is navigating. Though you have to be careful that the page doesn't omit values otherwise you get instances of zipping up items that don't match.
I did try using the class selector .credit-price but it turns out that this class is not attached to price displays without a price, fortunately there were another css class on there.

  • Locked thread