Python information and short questions megathread.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »

Scaevolus: Apr 16, 2007

chemosh6969 posted:

I've never done web scrapping before and was wondering if anyone knew of, or had, an example of an imdb scrape? I want to learn how to do it for a few other sites but really wanted to find an example to build off of.

Parsing their official data files would be more reliable-- http://www.imdb.com/interfaces

root beer posted:

Take a look at BeautifulSoup. It has an extensive documentation and would make the task loads easier.

This guy has a pretty good guide for how to use it.

lxml is faster and better than BeautifulSoup.

# ? Aug 24, 2009 09:49

Adbot: ADBOT LOVES YOU

# ? May 9, 2024 12:17

TheMaskedUgly: Sep 21, 2008; Let's play a different game.

Say I've got a list of values, in this form:
["reference-2-000111-A", "reference-2-000111-B", "reference-2-000111C"]
So there's three sections to each list, the word reference which is common to all values like this, the number which is common to all the values in this set, and the letter which is difference for each.
How can I separate a value in the list, so I can use the number elsewhere. How can I get the two parts of the number on their own if its in a list, basically.

# ? Aug 24, 2009 11:10

Ragg: Apr 27, 2003; <The Honorable Badgers>

TheMaskedUgly posted:

Say I've got a list of values, in this form:
["reference-2-000111-A", "reference-2-000111-B", "reference-2-000111C"]
So there's three sections to each list, the word reference which is common to all values like this, the number which is common to all the values in this set, and the letter which is difference for each.
How can I separate a value in the list, so I can use the number elsewhere. How can I get the two parts of the number on their own if its in a list, basically.

Call .split("-") on the string you want separated, that will give you a list that looks like ["reference", "2", "000111", "A"].

# ? Aug 24, 2009 11:21

deedee megadoodoo: Sep 28, 2000; Two roads diverged in a wood, and I, I took the one to Flavortown, and that has made all the difference.

TheMaskedUgly posted:

Say I've got a list of values, in this form:
["reference-2-000111-A", "reference-2-000111-B", "reference-2-000111C"]
So there's three sections to each list, the word reference which is common to all values like this, the number which is common to all the values in this set, and the letter which is difference for each.
How can I separate a value in the list, so I can use the number elsewhere. How can I get the two parts of the number on their own if its in a list, basically.

If every value starts with the word "reference," has a string of numbers and then a single letter at the end you can extract the number and letter with this...

code:

>>>data = ["reference-2-000111-A", "reference-2-000111-B", "reference-2-000111-C"]
>>>[(i[10:-2],i[-1]) for i in data]
[('2-000111', 'A'), ('2-000111', 'B'), ('2-000111', 'C')]

# ? Aug 24, 2009 13:39

m0nk3yz: Mar 13, 2002; Behold the power of cheese!

Scaevolus posted:

lxml is faster and better than BeautifulSoup.

Having just cut over to lxml for a project, I can completely agree with this - lxml is pure awesome.

# ? Aug 24, 2009 14:06

tef: May 30, 2004; -> some l-system crap ->

m0nk3yz posted:

Having just cut over to lxml for a project, I can completely agree with this - lxml is pure awesome.

I have also head good things about http://code.google.com/p/httplib2/. lxml has been excellent for scraping so far.

Aside: urllib2 is awful. pycurl does what you want but you'll inevitably write a wrapper for it.

Like this one: http://github.com/tef/durb/blob/83a63a57088db0ca16f5959111462e41af9572d1/fetch.py

# ? Aug 24, 2009 14:56

nbv4: Aug 21, 2002; by Duchess Gummybuns

Anyone have any experience with the csv module? I have a csv file that, after so many lines, switches headers. Like this:

code:

head1,head2,head3
12,172,34
453,568,235
342,834,2234
...
head4,head5,head6,head7
564,3576,3546,322
356,332,556,789
3435,435,76,980

I have a DictReader that is defined with the first header line, but I can't figure out how to handle setting the new header to the same DictReader. I know how to test for when I get to a new header row during iteration, I just can't figure out how to give the DictReader the new list of headers.

# ? Aug 24, 2009 21:37

DICTATOR OF FUNK: Nov 6, 2007; aaaaaw yeeeeeah

Scaevolus posted:

lxml is faster and better than BeautifulSoup.

You are my hero.

I've been using BeautifulSoup for a number of projects and this looks very promising. Thanks!

# ? Aug 24, 2009 22:08

deedee megadoodoo: Sep 28, 2000; Two roads diverged in a wood, and I, I took the one to Flavortown, and that has made all the difference.

nbv4 posted:

Anyone have any experience with the csv module? I have a csv file that, after so many lines, switches headers. Like this:
code:
head1,head2,head3
12,172,34
453,568,235
342,834,2234
...
head4,head5,head6,head7
564,3576,3546,322
356,332,556,789
3435,435,76,980
I have a DictReader that is defined with the first header line, but I can't figure out how to handle setting the new header to the same DictReader. I know how to test for when I get to a new header row during iteration, I just can't figure out how to give the DictReader the new list of headers.

How about reading through the whole file first and marking the header sections since you know how to identify them. Then seek(0) and start reading until you reach your next header. At that point recreate your DictReader and it might pick up the next header. Not sure if this works, but its worth a shot.

Alternatively subclass "reader" and alter the next() method to inspect the new line to see if it's a header.

# ? Aug 24, 2009 22:11

tef: May 30, 2004; -> some l-system crap ->

root beer posted:

You are my hero.

I've been using BeautifulSoup for a number of projects and this looks very promising. Thanks!

Learn xpath, it is very good for short scraping code.

A simple example: /html/head/title (guess what it does)

Or even //a/@href - this returns the href property for all of the a nodes in the document.

A couple of warnings about lxml:

The namespace support for xml is clunky, and not very convenient to use.

You can get regular expression support in xpaths too if you add the right namespace to your query.

It will mangle the HTML somewhat to ensure it's wellformed - i.e. it won't be the same as what appears in the browser (this includes character codes sometimes)

When you're downloading the page, you might have to understand character encoding & python. BeautifulSoup hides this much better (it has a UnicodeDammit module).

For example: If you try to decode the raw data with the character set in the header, and then try to use lxml to parse it, it can fail with a UnicodeExceptionError or something. (This only happens when the document itself specifies the encoding (I think only in the doctype, but I can't recall. It might happen more frequently with xml I dunno)).

And finally, // means descendant-or-self:/ not descendant-or-self:. in lxml

p.s. also a surprising amount of webservers are broken in weird ways. scraping can be a bit of a constant headache. character encodings will drive you insane after a while :v:

tef fucked around with this message at 00:36 on Aug 25, 2009

# ? Aug 25, 2009 00:34

DICTATOR OF FUNK: Nov 6, 2007; aaaaaw yeeeeeah

tef posted:

awesomeness

I familiarized myself with XPaths using Hpricot and Firebug on RoR a while back and missed them while using BeautifulSoup; lxml looks like it'll cut a large chunk of work out of what I'm currently developing.

All the tips are much appreciated, thanks.

# ? Aug 25, 2009 01:46

m0nk3yz: Mar 13, 2002; Behold the power of cheese!

tef posted:

I have also head good things about http://code.google.com/p/httplib2/. lxml has been excellent for scraping so far.

Aside: urllib2 is awful. pycurl does what you want but you'll inevitably write a wrapper for it.

Like this one: http://github.com/tef/durb/blob/83a63a57088db0ca16f5959111462e41af9572d1/fetch.py

I *always* use pycurl - and yeah, I *always* write a wrapper for it, but I'm dipping my toes into httplib2 for some work I'm doing.

# ? Aug 25, 2009 01:50

nonathlon: Jul 9, 2004; And yet, somehow, now it's my fault ...

nbv4 posted:

Anyone have any experience with the csv module? I have a csv file that, after so many lines, switches headers. Like this:

A good solution above, but out of curiosity are these "segmented" csv files common? I'm wondering how you can tell something is a new header except from context (i.e. these are letters and normal rows are numbers).

# ? Aug 25, 2009 13:31

nbv4: Aug 21, 2002; by Duchess Gummybuns

outlier posted:

A good solution above, but out of curiosity are these "segmented" csv files common? I'm wondering how you can tell something is a new header except from context (i.e. these are letters and normal rows are numbers).

the csv files are created by another application of mine, and all header rows begin with "##" so I know it's a header. The first column is always a date, so if there happens to be a pound sign in the date then theres problems anyway.

# ? Aug 25, 2009 21:35

duck monster: Dec 15, 2004

quote:

"In the current state of affairs and given the plans of the Python maintainers, there will be likely no python2.6 (even as non-default) in squeeze."

Yay debian!

# ? Aug 26, 2009 10:14

deedee megadoodoo: Sep 28, 2000; Two roads diverged in a wood, and I, I took the one to Flavortown, and that has made all the difference.

duck monster posted:

Yay debian!

I love it when people quote things without context.

# ? Aug 26, 2009 11:28

duck monster: Dec 15, 2004

Whats there to say? 2.6 won't be in the next debian.

# ? Aug 26, 2009 23:16

deedee megadoodoo: Sep 28, 2000; Two roads diverged in a wood, and I, I took the one to Flavortown, and that has made all the difference.

duck monster posted:

Whats there to say? 2.6 won't be in the next debian.

And?

[root@ashprdmon04 nagios]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
[root@ashprdmon04 nagios]# rpm -q python
python-2.4.3-24.el5

# ? Aug 27, 2009 12:24

Ethereal: Mar 8, 2003

Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider.

# ? Aug 28, 2009 00:21

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

Ethereal posted:

Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider.

I've had plenty of success with Dreamhost.

# ? Aug 28, 2009 00:38

Xenos: Jun 17, 2005

Ethereal posted:

Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider.

WebFaction is great when it comes to running Python projects. You can deploy pretty much however you want, and the admins are incredibly helpful.

# ? Aug 28, 2009 01:16

king_kilr: May 25, 2007

Ethereal posted:

Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider.

without a doubt webfaction.

# ? Aug 28, 2009 01:29

m0nk3yz: Mar 13, 2002; Behold the power of cheese!

Ethereal posted:

Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider.

Webfaction++

# ? Aug 28, 2009 02:43

m0nk3yz: Mar 13, 2002; Behold the power of cheese!

So I know the liklihood of any of you running Python 3 regularly is slim, but I thought I'd throw this out there for you - GSoC just completed, and we now have an alpha version of a Python 3 to Python 2 conversion tool:

quote:

Hello all,

I have released the first alpha version of 3to2 after finishing it for my
Google Summer of Code 2009(tm) project. You can get the tarball for this
release at
http://bitbucket.org/amentajo/lib3to2/downloads/3to2_0.1-alpha1.tar.gz.
This requires python 2.7, because it requires a newer version of 2to3 than
what comes with 2.6.

Release notes are in the RELEASE file. Development happens at
http://bitbucket.org/amentajo/lib3to2/, and the source code for this release
lives at http://bitbucket.org/amentajo/3to2-01-alpha-1.
Report bugs at http://bitbucket.org/amentajo/lib3to2/issues/, please.

Additional notes and comments can (for now) be found at
http://www.startcodon.com/wordpress/?cat=4.

Please, if you have the time - give it a try.

# ? Aug 28, 2009 02:45

duck monster: Dec 15, 2004

HatfulOfHollow posted:

And?

[root@ashprdmon04 nagios]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
[root@ashprdmon04 nagios]# rpm -q python
python-2.4.3-24.el5

Well thats poo poo too. Its a problem, because it slows adoption of 2.6/3.0 and that means python devs have to keep supporting legacy code bases, and provides disincentives for people to modernise their code for the superior 2.6/3.0 branches.

# ? Aug 28, 2009 06:58

Scaevolus: Apr 16, 2007

root beer posted:

I familiarized myself with XPaths using Hpricot and Firebug on RoR a while back and missed them while using BeautifulSoup; lxml looks like it'll cut a large chunk of work out of what I'm currently developing.

All the tips are much appreciated, thanks.

lxml can also do CSS selectors, which are often simpler to read than XPath (it translates them into XPath to perform the query).

# ? Aug 28, 2009 07:16

king_kilr: May 25, 2007

There's also pyquery: http://pypi.python.org/pypi/pyquery I don't know what the performance is like, but it seems cool to me

# ? Aug 28, 2009 07:24

deedee megadoodoo: Sep 28, 2000; Two roads diverged in a wood, and I, I took the one to Flavortown, and that has made all the difference.

duck monster posted:

Well thats poo poo too. Its a problem, because it slows adoption of 2.6/3.0 and that means python devs have to keep supporting legacy code bases, and provides disincentives for people to modernise their code for the superior 2.6/3.0 branches.

I completely agree. But I've just started accepting that Redhat is going to be slow to adopt new versions of anything, even if it offers vast improvement. Hell, we're still building some boxes with RHEL4 and those REALLY suck...

-bash-3.00$ cat /etc/redhat-release
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
-bash-3.00$ rpm -q python
python-2.3.4-14.2

# ? Aug 28, 2009 13:55

m0nk3yz: Mar 13, 2002; Behold the power of cheese!

duck monster posted:

Well thats poo poo too. Its a problem, because it slows adoption of 2.6/3.0 and that means python devs have to keep supporting legacy code bases, and provides disincentives for people to modernise their code for the superior 2.6/3.0 branches.

If I had poo poo-tons of money, I'd personally fly to the house of each OS maintainer and attempt to bribe them to at least upgrade to 2.6 latest. I just jumped to FC11 simply due to the 2.6 support. Later today I'm rebuilding my laptop with snow leopard for 2.6. We hateses old installs.

# ? Aug 28, 2009 14:04

EvoDevo: Jul 5, 2007

I am very new to Python and programming in general. I am joining a lab and one of the initial projects is to trim a file. We currently have the following before a DNA sequence:

>FBtr0071764 type=mRNA; loc=2R:join(18024938..18025756,18039159..18039200,18050410..18051199,18052282..18052494,18056749..18058222,18058283..18059490,18059587..18059757,18059821..18059938,18060002..18060346); ID=FBtr0071764; name=a-RB; dbxref=FlyBase_Annotation_IDs:CG6741-RB,FlyBase:FBtr0071764,REFSEQ:NM_079902; score=7; score_text=Moderately Supported; MD5=8ef39c97ee0a0abc31900fefe3fbce8f; length=5180; parent=FBgn0000008; release=r5.19; species=Dmel;

We would like it to read:

>CG6741-RB

We have a file with about 1500 of these, each one followed by about 2000 characters (AGCT) for DNA sequence. What would be the best way to remove all of the unnecessary data?

# ? Aug 28, 2009 18:16

Lurchington: Jan 2, 2003; Forums Dragoon

a regular expression similar to:

re.findall(r"FlyBase_Annotation_IDs\

.*?),", YOUR_BLOCK_OF_TEXT)

should be a start

code:

import re

print re.findall(r"FlyBase_Annotation_IDs\:(.*?),", YOUR_BLOCK_OF_TEXT)

it's not clear what us besides your FlyBase_Annotation_IDs you want to keep, but you could also use split or anything at the RE module page

# ? Aug 28, 2009 18:24

nonathlon: Jul 9, 2004; And yet, somehow, now it's my fault ...

EvoDevo posted:

I am very new to Python and programming in general. I am joining a lab and one of the initial projects is to trim a file. We currently have the following before a DNA sequence:

>FBtr0071764 type=mRNA; loc=2R:join(18024938..18025756,18039159..18039200,18050410..18051199,18052282..18052494,18056749..18058222,18058283..18059490,18059587..18059757,18059821..18059938,18060002..18060346); ID=FBtr0071764; name=a-RB; dbxref=FlyBase_Annotation_IDs:CG6741-RB,FlyBase:FBtr0071764,REFSEQ:NM_079902; score=7; score_text=Moderately Supported; MD5=8ef39c97ee0a0abc31900fefe3fbce8f; length=5180; parent=FBgn0000008; release=r5.19; species=Dmel;

We would like it to read:

>CG6741-RB

We have a file with about 1500 of these, each one followed by about 2000 characters (AGCT) for DNA sequence. What would be the best way to remove all of the unnecessary data?

Been there, done that. Use a regex to capture the head and feature lines, something like r'^>[^;]+;$'. Then take the text that you captured and use a regex on it, something like r'FlyBase_Annotation_ID:([^,]*)' and replace the original text with the captured subpattern. You could combine the two regexes into one, but it might be easier to build / debug if you do it stage by stage.

# ? Aug 28, 2009 18:27

Scaevolus: Apr 16, 2007

EvoDevo posted:

I am very new to Python and programming in general. I am joining a lab and one of the initial projects is to trim a file. We currently have the following before a DNA sequence:

>FBtr0071764 type=mRNA; loc=2R:join(18024938..18025756,18039159..18039200,18050410..18051199,18052282..18052494,18056749..18058222,18058283..18059490,18059587..18059757,18059821..18059938,18060002..18060346); ID=FBtr0071764; name=a-RB; dbxref=FlyBase_Annotation_IDs:CG6741-RB,FlyBase:FBtr0071764,REFSEQ:NM_079902; score=7; score_text=Moderately Supported; MD5=8ef39c97ee0a0abc31900fefe3fbce8f; length=5180; parent=FBgn0000008; release=r5.19; species=Dmel;

We would like it to read:

>CG6741-RB

We have a file with about 1500 of these, each one followed by about 2000 characters (AGCT) for DNA sequence. What would be the best way to remove all of the unnecessary data?

A proper parser would be best, but you could probably get away with using regexes.

Something like this:

code:

import re

dna_file = open("dna.whatever", 'U')
output_file = open("dna.whatever.trimmed", "w")

line = dna_file.readline()
chunk = ''
while line != '':
    if line.startswith('>'):
        chunk = line + dna_file.readline() + dna_file.readline()
        output_file.write(re.sub(r'.*\n.*dbxref=(?:[\w-]*:)([\w-]+).*\n([AGCT]+)',
                          r'>\1\n\2\n', chunk))
    line = dna_file.readline()

There are a lot of assumptions in that but it should work if your DNA file is well formed.

# ? Aug 28, 2009 18:42

Threep: Apr 1, 2006; It's kind of a long story.

I don't think I've ever seen "well formed" output from scientific software. I have a friend working on her Ph.D. in I think it's biochem and she's spent so much time writing Perl and whatnot to deal with the output that she's considering a career in programming now.

Most scientific software is, well, Dwarf Fortress to put it succinctly.

# ? Aug 28, 2009 18:52

nonathlon: Jul 9, 2004; And yet, somehow, now it's my fault ...

Threep posted:

I don't think I've ever seen "well formed" output from scientific software. I have a friend working on her Ph.D. in I think it's biochem and she's spent so much time writing Perl and whatnot to deal with the output that she's considering a career in programming now.

Most scientific software is, well, Dwarf Fortress to put it succinctly.

Most scientific software is written by people who are self-taught programmers. If you ever doubt the value of university CS degrees (and being formally taught shibboleths like commenting, structured programmers and recursion) look at the source code of most bioinformatics software. I once went to hack on one prominent program, looking to make a simple change. "Huh, only six files. This should be easy." Each file was over 4000 lines long, uncommented, and full of variables like x, x2, x3, h, hh, foo and bar. I eventually gave up, finding it easier to rewrite the entire program.

# ? Aug 29, 2009 09:25

UnNethertrash: Jun 14, 2007; The Trashman Cometh

If you have a list that you know is sorted, is there a built in way to efficiently search the list?

For example, say I have a list of primes under 1 million in ascending order, and I want to check a number to see if it's prime.

I'm not very good at programming, but I know how to write a binary search. However, I think that a) there are probably even faster ways to search, b) python would have something built in.

Thanks in advance.

# ? Aug 29, 2009 18:54

king_kilr: May 25, 2007

UnNethertrash posted:

If you have a list that you know is sorted, is there a built in way to efficiently search the list?

For example, say I have a list of primes under 1 million in ascending order, and I want to check a number to see if it's prime.

I'm not very good at programming, but I know how to write a binary search. However, I think that a) there are probably even faster ways to search, b) python would have something built in.

Thanks in advance.

Binary search is the best way. You can use the bisect module.

# ? Aug 29, 2009 19:10

Tin Gang: Sep 27, 2007; Tin Gang posted:
showering has no effect on germs and is terrible for your skin. there is no good reason to do it

http://docs.python.org/tutorial/datastructures.html#more-on-lists

you could also do list.count(). It returns 0 if something is not in a list, and 1 or more if it is. I don't know if that's the best way but it's certainly the least complicated.

# ? Aug 29, 2009 19:56

Avenging Dentist: Oct 1, 2005; oh my god is that a circular saw that does not go in my mouth aaaaagh

Or just, you know, use a set.

# ? Aug 29, 2009 20:08

Adbot: ADBOT LOVES YOU

# ? May 9, 2024 12:17

Janitor Prime: Jan 22, 2004; PC LOAD LETTER

What da fuck does that mean; Fun Shoe

Aardlof posted:

http://docs.python.org/tutorial/datastructures.html#more-on-lists

you could also do list.count(). It returns 0 if something is not in a list, and 1 or more if it is. I don't know if that's the best way but it's certainly the least complicated.

This is definitely not the best way. Python has the bisect module as already stated which does a binary search on an ordered list.

# ? Aug 29, 2009 20:46

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »

chemosh6969 posted:

root beer posted:

TheMaskedUgly posted:

TheMaskedUgly posted:

code:

Scaevolus posted:

m0nk3yz posted:

code:

Scaevolus posted:

nbv4 posted:

code:

root beer posted:

tef posted:

tef posted:

nbv4 posted:

outlier posted:

quote:

duck monster posted:

duck monster posted:

Ethereal posted:

Ethereal posted:

Ethereal posted:

Ethereal posted:

quote:

HatfulOfHollow posted:

root beer posted:

duck monster posted:

duck monster posted:

code:

EvoDevo posted:

EvoDevo posted:

code:

Threep posted:

UnNethertrash posted:

Tin Gang posted:

Aardlof posted: