Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Scaevolus
Apr 16, 2007

chemosh6969 posted:

I've never done web scrapping before and was wondering if anyone knew of, or had, an example of an imdb scrape? I want to learn how to do it for a few other sites but really wanted to find an example to build off of.
Parsing their official data files would be more reliable-- http://www.imdb.com/interfaces

root beer posted:

Take a look at BeautifulSoup. It has an extensive documentation and would make the task loads easier.

This guy has a pretty good guide for how to use it.
lxml is faster and better than BeautifulSoup.

Adbot
ADBOT LOVES YOU

TheMaskedUgly
Sep 21, 2008

Let's play a different game.
Say I've got a list of values, in this form:
["reference-2-000111-A", "reference-2-000111-B", "reference-2-000111C"]
So there's three sections to each list, the word reference which is common to all values like this, the number which is common to all the values in this set, and the letter which is difference for each.
How can I separate a value in the list, so I can use the number elsewhere. How can I get the two parts of the number on their own if its in a list, basically.

Ragg
Apr 27, 2003

<The Honorable Badgers>

TheMaskedUgly posted:

Say I've got a list of values, in this form:
["reference-2-000111-A", "reference-2-000111-B", "reference-2-000111C"]
So there's three sections to each list, the word reference which is common to all values like this, the number which is common to all the values in this set, and the letter which is difference for each.
How can I separate a value in the list, so I can use the number elsewhere. How can I get the two parts of the number on their own if its in a list, basically.

Call .split("-") on the string you want separated, that will give you a list that looks like ["reference", "2", "000111", "A"].

deedee megadoodoo
Sep 28, 2000
Two roads diverged in a wood, and I, I took the one to Flavortown, and that has made all the difference.


TheMaskedUgly posted:

Say I've got a list of values, in this form:
["reference-2-000111-A", "reference-2-000111-B", "reference-2-000111C"]
So there's three sections to each list, the word reference which is common to all values like this, the number which is common to all the values in this set, and the letter which is difference for each.
How can I separate a value in the list, so I can use the number elsewhere. How can I get the two parts of the number on their own if its in a list, basically.

If every value starts with the word "reference," has a string of numbers and then a single letter at the end you can extract the number and letter with this...

code:
>>>data = ["reference-2-000111-A", "reference-2-000111-B", "reference-2-000111-C"]
>>>[(i[10:-2],i[-1]) for i in data]
[('2-000111', 'A'), ('2-000111', 'B'), ('2-000111', 'C')]

m0nk3yz
Mar 13, 2002

Behold the power of cheese!

Scaevolus posted:

lxml is faster and better than BeautifulSoup.

Having just cut over to lxml for a project, I can completely agree with this - lxml is pure awesome.

tef
May 30, 2004

-> some l-system crap ->

m0nk3yz posted:

Having just cut over to lxml for a project, I can completely agree with this - lxml is pure awesome.

I have also head good things about http://code.google.com/p/httplib2/. lxml has been excellent for scraping so far.

Aside: urllib2 is awful. pycurl does what you want but you'll inevitably write a wrapper for it.

Like this one: http://github.com/tef/durb/blob/83a63a57088db0ca16f5959111462e41af9572d1/fetch.py

nbv4
Aug 21, 2002

by Duchess Gummybuns
Anyone have any experience with the csv module? I have a csv file that, after so many lines, switches headers. Like this:

code:
head1,head2,head3
12,172,34
453,568,235
342,834,2234
...
head4,head5,head6,head7
564,3576,3546,322
356,332,556,789
3435,435,76,980
I have a DictReader that is defined with the first header line, but I can't figure out how to handle setting the new header to the same DictReader. I know how to test for when I get to a new header row during iteration, I just can't figure out how to give the DictReader the new list of headers.

DICTATOR OF FUNK
Nov 6, 2007

aaaaaw yeeeeeah

Scaevolus posted:

lxml is faster and better than BeautifulSoup.
You are my hero.

I've been using BeautifulSoup for a number of projects and this looks very promising. Thanks!

deedee megadoodoo
Sep 28, 2000
Two roads diverged in a wood, and I, I took the one to Flavortown, and that has made all the difference.


nbv4 posted:

Anyone have any experience with the csv module? I have a csv file that, after so many lines, switches headers. Like this:

code:
head1,head2,head3
12,172,34
453,568,235
342,834,2234
...
head4,head5,head6,head7
564,3576,3546,322
356,332,556,789
3435,435,76,980
I have a DictReader that is defined with the first header line, but I can't figure out how to handle setting the new header to the same DictReader. I know how to test for when I get to a new header row during iteration, I just can't figure out how to give the DictReader the new list of headers.

How about reading through the whole file first and marking the header sections since you know how to identify them. Then seek(0) and start reading until you reach your next header. At that point recreate your DictReader and it might pick up the next header. Not sure if this works, but its worth a shot.

Alternatively subclass "reader" and alter the next() method to inspect the new line to see if it's a header.

tef
May 30, 2004

-> some l-system crap ->

root beer posted:

You are my hero.

I've been using BeautifulSoup for a number of projects and this looks very promising. Thanks!

Learn xpath, it is very good for short scraping code.

A simple example: /html/head/title (guess what it does)

Or even //a/@href - this returns the href property for all of the a nodes in the document.

A couple of warnings about lxml:

The namespace support for xml is clunky, and not very convenient to use.

You can get regular expression support in xpaths too if you add the right namespace to your query.

It will mangle the HTML somewhat to ensure it's wellformed - i.e. it won't be the same as what appears in the browser (this includes character codes sometimes)

When you're downloading the page, you might have to understand character encoding & python. BeautifulSoup hides this much better (it has a UnicodeDammit module).

For example: If you try to decode the raw data with the character set in the header, and then try to use lxml to parse it, it can fail with a UnicodeExceptionError or something. (This only happens when the document itself specifies the encoding (I think only in the doctype, but I can't recall. It might happen more frequently with xml I dunno)).

And finally, // means descendant-or-self:/ not descendant-or-self:. in lxml


p.s. also a surprising amount of webservers are broken in weird ways. scraping can be a bit of a constant headache. character encodings will drive you insane after a while :v:

tef fucked around with this message at 00:36 on Aug 25, 2009

DICTATOR OF FUNK
Nov 6, 2007

aaaaaw yeeeeeah

tef posted:

awesomeness
I familiarized myself with XPaths using Hpricot and Firebug on RoR a while back and missed them while using BeautifulSoup; lxml looks like it'll cut a large chunk of work out of what I'm currently developing.

All the tips are much appreciated, thanks. :)

m0nk3yz
Mar 13, 2002

Behold the power of cheese!

tef posted:

I have also head good things about http://code.google.com/p/httplib2/. lxml has been excellent for scraping so far.

Aside: urllib2 is awful. pycurl does what you want but you'll inevitably write a wrapper for it.

Like this one: http://github.com/tef/durb/blob/83a63a57088db0ca16f5959111462e41af9572d1/fetch.py

I *always* use pycurl - and yeah, I *always* write a wrapper for it, but I'm dipping my toes into httplib2 for some work I'm doing.

nonathlon
Jul 9, 2004
And yet, somehow, now it's my fault ...

nbv4 posted:

Anyone have any experience with the csv module? I have a csv file that, after so many lines, switches headers. Like this:

A good solution above, but out of curiosity are these "segmented" csv files common? I'm wondering how you can tell something is a new header except from context (i.e. these are letters and normal rows are numbers).

nbv4
Aug 21, 2002

by Duchess Gummybuns

outlier posted:

A good solution above, but out of curiosity are these "segmented" csv files common? I'm wondering how you can tell something is a new header except from context (i.e. these are letters and normal rows are numbers).

the csv files are created by another application of mine, and all header rows begin with "##" so I know it's a header. The first column is always a date, so if there happens to be a pound sign in the date then theres problems anyway.

duck monster
Dec 15, 2004

quote:

"In the current state of affairs and given the plans of the Python maintainers, there will be likely no python2.6 (even as non-default) in squeeze."

Yay debian!

deedee megadoodoo
Sep 28, 2000
Two roads diverged in a wood, and I, I took the one to Flavortown, and that has made all the difference.


duck monster posted:

Yay debian!

I love it when people quote things without context.

duck monster
Dec 15, 2004

Whats there to say? 2.6 won't be in the next debian.

deedee megadoodoo
Sep 28, 2000
Two roads diverged in a wood, and I, I took the one to Flavortown, and that has made all the difference.


duck monster posted:

Whats there to say? 2.6 won't be in the next debian.

And?

[root@ashprdmon04 nagios]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
[root@ashprdmon04 nagios]# rpm -q python
python-2.4.3-24.el5

Ethereal
Mar 8, 2003
Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Ethereal posted:

Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider.

I've had plenty of success with Dreamhost.

Xenos
Jun 17, 2005

Ethereal posted:

Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider.

WebFaction is great when it comes to running Python projects. You can deploy pretty much however you want, and the admins are incredibly helpful.

king_kilr
May 25, 2007

Ethereal posted:

Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider.

without a doubt webfaction.

m0nk3yz
Mar 13, 2002

Behold the power of cheese!

Ethereal posted:

Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider.

Webfaction++

m0nk3yz
Mar 13, 2002

Behold the power of cheese!
So I know the liklihood of any of you running Python 3 regularly is slim, but I thought I'd throw this out there for you - GSoC just completed, and we now have an alpha version of a Python 3 to Python 2 conversion tool:

quote:

Hello all,

I have released the first alpha version of 3to2 after finishing it for my
Google Summer of Code 2009(tm) project. You can get the tarball for this
release at
http://bitbucket.org/amentajo/lib3to2/downloads/3to2_0.1-alpha1.tar.gz.
This requires python 2.7, because it requires a newer version of 2to3 than
what comes with 2.6.

Release notes are in the RELEASE file. Development happens at
http://bitbucket.org/amentajo/lib3to2/, and the source code for this release
lives at http://bitbucket.org/amentajo/3to2-01-alpha-1.
Report bugs at http://bitbucket.org/amentajo/lib3to2/issues/, please.

Additional notes and comments can (for now) be found at
http://www.startcodon.com/wordpress/?cat=4.

Please, if you have the time - give it a try.

duck monster
Dec 15, 2004

HatfulOfHollow posted:

And?

[root@ashprdmon04 nagios]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
[root@ashprdmon04 nagios]# rpm -q python
python-2.4.3-24.el5

Well thats poo poo too. Its a problem, because it slows adoption of 2.6/3.0 and that means python devs have to keep supporting legacy code bases, and provides disincentives for people to modernise their code for the superior 2.6/3.0 branches.

Scaevolus
Apr 16, 2007

root beer posted:

I familiarized myself with XPaths using Hpricot and Firebug on RoR a while back and missed them while using BeautifulSoup; lxml looks like it'll cut a large chunk of work out of what I'm currently developing.

All the tips are much appreciated, thanks. :)

lxml can also do CSS selectors, which are often simpler to read than XPath (it translates them into XPath to perform the query).

king_kilr
May 25, 2007
There's also pyquery: http://pypi.python.org/pypi/pyquery I don't know what the performance is like, but it seems cool to me :)

deedee megadoodoo
Sep 28, 2000
Two roads diverged in a wood, and I, I took the one to Flavortown, and that has made all the difference.


duck monster posted:

Well thats poo poo too. Its a problem, because it slows adoption of 2.6/3.0 and that means python devs have to keep supporting legacy code bases, and provides disincentives for people to modernise their code for the superior 2.6/3.0 branches.

I completely agree. But I've just started accepting that Redhat is going to be slow to adopt new versions of anything, even if it offers vast improvement. Hell, we're still building some boxes with RHEL4 and those REALLY suck...

-bash-3.00$ cat /etc/redhat-release
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
-bash-3.00$ rpm -q python
python-2.3.4-14.2

m0nk3yz
Mar 13, 2002

Behold the power of cheese!

duck monster posted:

Well thats poo poo too. Its a problem, because it slows adoption of 2.6/3.0 and that means python devs have to keep supporting legacy code bases, and provides disincentives for people to modernise their code for the superior 2.6/3.0 branches.

If I had poo poo-tons of money, I'd personally fly to the house of each OS maintainer and attempt to bribe them to at least upgrade to 2.6 latest. I just jumped to FC11 simply due to the 2.6 support. Later today I'm rebuilding my laptop with snow leopard for 2.6. We hateses old installs.

EvoDevo
Jul 5, 2007
I am very new to Python and programming in general. I am joining a lab and one of the initial projects is to trim a file. We currently have the following before a DNA sequence:

>FBtr0071764 type=mRNA; loc=2R:join(18024938..18025756,18039159..18039200,18050410..18051199,18052282..18052494,18056749..18058222,18058283..18059490,18059587..18059757,18059821..18059938,18060002..18060346); ID=FBtr0071764; name=a-RB; dbxref=FlyBase_Annotation_IDs:CG6741-RB,FlyBase:FBtr0071764,REFSEQ:NM_079902; score=7; score_text=Moderately Supported; MD5=8ef39c97ee0a0abc31900fefe3fbce8f; length=5180; parent=FBgn0000008; release=r5.19; species=Dmel;

We would like it to read:

>CG6741-RB

We have a file with about 1500 of these, each one followed by about 2000 characters (AGCT) for DNA sequence. What would be the best way to remove all of the unnecessary data?

Lurchington
Jan 2, 2003

Forums Dragoon
a regular expression similar to:

re.findall(r"FlyBase_Annotation_IDs\:(.*?),", YOUR_BLOCK_OF_TEXT)

should be a start

code:
import re

print re.findall(r"FlyBase_Annotation_IDs\:(.*?),", YOUR_BLOCK_OF_TEXT)
it's not clear what us besides your FlyBase_Annotation_IDs you want to keep, but you could also use split or anything at the RE module page

nonathlon
Jul 9, 2004
And yet, somehow, now it's my fault ...

EvoDevo posted:

I am very new to Python and programming in general. I am joining a lab and one of the initial projects is to trim a file. We currently have the following before a DNA sequence:

>FBtr0071764 type=mRNA; loc=2R:join(18024938..18025756,18039159..18039200,18050410..18051199,18052282..18052494,18056749..18058222,18058283..18059490,18059587..18059757,18059821..18059938,18060002..18060346); ID=FBtr0071764; name=a-RB; dbxref=FlyBase_Annotation_IDs:CG6741-RB,FlyBase:FBtr0071764,REFSEQ:NM_079902; score=7; score_text=Moderately Supported; MD5=8ef39c97ee0a0abc31900fefe3fbce8f; length=5180; parent=FBgn0000008; release=r5.19; species=Dmel;

We would like it to read:

>CG6741-RB

We have a file with about 1500 of these, each one followed by about 2000 characters (AGCT) for DNA sequence. What would be the best way to remove all of the unnecessary data?

Been there, done that. Use a regex to capture the head and feature lines, something like r'^>[^;]+;$'. Then take the text that you captured and use a regex on it, something like r'FlyBase_Annotation_ID:([^,]*)' and replace the original text with the captured subpattern. You could combine the two regexes into one, but it might be easier to build / debug if you do it stage by stage.

Scaevolus
Apr 16, 2007

EvoDevo posted:

I am very new to Python and programming in general. I am joining a lab and one of the initial projects is to trim a file. We currently have the following before a DNA sequence:

>FBtr0071764 type=mRNA; loc=2R:join(18024938..18025756,18039159..18039200,18050410..18051199,18052282..18052494,18056749..18058222,18058283..18059490,18059587..18059757,18059821..18059938,18060002..18060346); ID=FBtr0071764; name=a-RB; dbxref=FlyBase_Annotation_IDs:CG6741-RB,FlyBase:FBtr0071764,REFSEQ:NM_079902; score=7; score_text=Moderately Supported; MD5=8ef39c97ee0a0abc31900fefe3fbce8f; length=5180; parent=FBgn0000008; release=r5.19; species=Dmel;

We would like it to read:

>CG6741-RB

We have a file with about 1500 of these, each one followed by about 2000 characters (AGCT) for DNA sequence. What would be the best way to remove all of the unnecessary data?

A proper parser would be best, but you could probably get away with using regexes.

Something like this:
code:
import re

dna_file = open("dna.whatever", 'U')
output_file = open("dna.whatever.trimmed", "w")

line = dna_file.readline()
chunk = ''
while line != '':
    if line.startswith('>'):
        chunk = line + dna_file.readline() + dna_file.readline()
        output_file.write(re.sub(r'.*\n.*dbxref=(?:[\w-]*:)([\w-]+).*\n([AGCT]+)',
                          r'>\1\n\2\n', chunk))
    line = dna_file.readline()
There are a lot of assumptions in that but it should work if your DNA file is well formed.

Threep
Apr 1, 2006

It's kind of a long story.
I don't think I've ever seen "well formed" output from scientific software. I have a friend working on her Ph.D. in I think it's biochem and she's spent so much time writing Perl and whatnot to deal with the output that she's considering a career in programming now.

Most scientific software is, well, Dwarf Fortress to put it succinctly.

nonathlon
Jul 9, 2004
And yet, somehow, now it's my fault ...

Threep posted:

I don't think I've ever seen "well formed" output from scientific software. I have a friend working on her Ph.D. in I think it's biochem and she's spent so much time writing Perl and whatnot to deal with the output that she's considering a career in programming now.

Most scientific software is, well, Dwarf Fortress to put it succinctly.

Most scientific software is written by people who are self-taught programmers. If you ever doubt the value of university CS degrees (and being formally taught shibboleths like commenting, structured programmers and recursion) look at the source code of most bioinformatics software. I once went to hack on one prominent program, looking to make a simple change. "Huh, only six files. This should be easy." Each file was over 4000 lines long, uncommented, and full of variables like x, x2, x3, h, hh, foo and bar. I eventually gave up, finding it easier to rewrite the entire program.

UnNethertrash
Jun 14, 2007
The Trashman Cometh
If you have a list that you know is sorted, is there a built in way to efficiently search the list?

For example, say I have a list of primes under 1 million in ascending order, and I want to check a number to see if it's prime.

I'm not very good at programming, but I know how to write a binary search. However, I think that a) there are probably even faster ways to search, b) python would have something built in.

Thanks in advance.

king_kilr
May 25, 2007

UnNethertrash posted:

If you have a list that you know is sorted, is there a built in way to efficiently search the list?

For example, say I have a list of primes under 1 million in ascending order, and I want to check a number to see if it's prime.

I'm not very good at programming, but I know how to write a binary search. However, I think that a) there are probably even faster ways to search, b) python would have something built in.

Thanks in advance.

Binary search is the best way. You can use the bisect module.

Tin Gang
Sep 27, 2007

Tin Gang posted:

showering has no effect on germs and is terrible for your skin. there is no good reason to do it
http://docs.python.org/tutorial/datastructures.html#more-on-lists

you could also do list.count(). It returns 0 if something is not in a list, and 1 or more if it is. I don't know if that's the best way but it's certainly the least complicated.

Avenging Dentist
Oct 1, 2005

oh my god is that a circular saw that does not go in my mouth aaaaagh
Or just, you know, use a set.

Adbot
ADBOT LOVES YOU

Janitor Prime
Jan 22, 2004

PC LOAD LETTER

What da fuck does that mean

Fun Shoe

Aardlof posted:

http://docs.python.org/tutorial/datastructures.html#more-on-lists

you could also do list.count(). It returns 0 if something is not in a list, and 1 or more if it is. I don't know if that's the best way but it's certainly the least complicated.

This is definitely not the best way. Python has the bisect module as already stated which does a binary search on an ordered list.

  • Locked thread