|
chemosh6969 posted:I've never done web scrapping before and was wondering if anyone knew of, or had, an example of an imdb scrape? I want to learn how to do it for a few other sites but really wanted to find an example to build off of. root beer posted:Take a look at BeautifulSoup. It has an extensive documentation and would make the task loads easier.
|
# ? Aug 24, 2009 09:49 |
|
|
# ? May 9, 2024 12:17 |
|
Say I've got a list of values, in this form: ["reference-2-000111-A", "reference-2-000111-B", "reference-2-000111C"] So there's three sections to each list, the word reference which is common to all values like this, the number which is common to all the values in this set, and the letter which is difference for each. How can I separate a value in the list, so I can use the number elsewhere. How can I get the two parts of the number on their own if its in a list, basically.
|
# ? Aug 24, 2009 11:10 |
|
TheMaskedUgly posted:Say I've got a list of values, in this form: Call .split("-") on the string you want separated, that will give you a list that looks like ["reference", "2", "000111", "A"].
|
# ? Aug 24, 2009 11:21 |
|
TheMaskedUgly posted:Say I've got a list of values, in this form: If every value starts with the word "reference," has a string of numbers and then a single letter at the end you can extract the number and letter with this... code:
|
# ? Aug 24, 2009 13:39 |
|
Scaevolus posted:lxml is faster and better than BeautifulSoup. Having just cut over to lxml for a project, I can completely agree with this - lxml is pure awesome.
|
# ? Aug 24, 2009 14:06 |
|
m0nk3yz posted:Having just cut over to lxml for a project, I can completely agree with this - lxml is pure awesome. I have also head good things about http://code.google.com/p/httplib2/. lxml has been excellent for scraping so far. Aside: urllib2 is awful. pycurl does what you want but you'll inevitably write a wrapper for it. Like this one: http://github.com/tef/durb/blob/83a63a57088db0ca16f5959111462e41af9572d1/fetch.py
|
# ? Aug 24, 2009 14:56 |
|
Anyone have any experience with the csv module? I have a csv file that, after so many lines, switches headers. Like this:code:
|
# ? Aug 24, 2009 21:37 |
|
Scaevolus posted:lxml is faster and better than BeautifulSoup. I've been using BeautifulSoup for a number of projects and this looks very promising. Thanks!
|
# ? Aug 24, 2009 22:08 |
|
nbv4 posted:Anyone have any experience with the csv module? I have a csv file that, after so many lines, switches headers. Like this: How about reading through the whole file first and marking the header sections since you know how to identify them. Then seek(0) and start reading until you reach your next header. At that point recreate your DictReader and it might pick up the next header. Not sure if this works, but its worth a shot. Alternatively subclass "reader" and alter the next() method to inspect the new line to see if it's a header.
|
# ? Aug 24, 2009 22:11 |
|
root beer posted:You are my hero. Learn xpath, it is very good for short scraping code. A simple example: /html/head/title (guess what it does) Or even //a/@href - this returns the href property for all of the a nodes in the document. A couple of warnings about lxml: The namespace support for xml is clunky, and not very convenient to use. You can get regular expression support in xpaths too if you add the right namespace to your query. It will mangle the HTML somewhat to ensure it's wellformed - i.e. it won't be the same as what appears in the browser (this includes character codes sometimes) When you're downloading the page, you might have to understand character encoding & python. BeautifulSoup hides this much better (it has a UnicodeDammit module). For example: If you try to decode the raw data with the character set in the header, and then try to use lxml to parse it, it can fail with a UnicodeExceptionError or something. (This only happens when the document itself specifies the encoding (I think only in the doctype, but I can't recall. It might happen more frequently with xml I dunno)). And finally, // means descendant-or-self:/ not descendant-or-self:. in lxml p.s. also a surprising amount of webservers are broken in weird ways. scraping can be a bit of a constant headache. character encodings will drive you insane after a while tef fucked around with this message at 00:36 on Aug 25, 2009 |
# ? Aug 25, 2009 00:34 |
|
tef posted:awesomeness All the tips are much appreciated, thanks.
|
# ? Aug 25, 2009 01:46 |
|
tef posted:I have also head good things about http://code.google.com/p/httplib2/. lxml has been excellent for scraping so far. I *always* use pycurl - and yeah, I *always* write a wrapper for it, but I'm dipping my toes into httplib2 for some work I'm doing.
|
# ? Aug 25, 2009 01:50 |
|
nbv4 posted:Anyone have any experience with the csv module? I have a csv file that, after so many lines, switches headers. Like this: A good solution above, but out of curiosity are these "segmented" csv files common? I'm wondering how you can tell something is a new header except from context (i.e. these are letters and normal rows are numbers).
|
# ? Aug 25, 2009 13:31 |
|
outlier posted:A good solution above, but out of curiosity are these "segmented" csv files common? I'm wondering how you can tell something is a new header except from context (i.e. these are letters and normal rows are numbers). the csv files are created by another application of mine, and all header rows begin with "##" so I know it's a header. The first column is always a date, so if there happens to be a pound sign in the date then theres problems anyway.
|
# ? Aug 25, 2009 21:35 |
|
quote:"In the current state of affairs and given the plans of the Python maintainers, there will be likely no python2.6 (even as non-default) in squeeze." Yay debian!
|
# ? Aug 26, 2009 10:14 |
|
duck monster posted:Yay debian! I love it when people quote things without context.
|
# ? Aug 26, 2009 11:28 |
|
Whats there to say? 2.6 won't be in the next debian.
|
# ? Aug 26, 2009 23:16 |
|
duck monster posted:Whats there to say? 2.6 won't be in the next debian. And? [root@ashprdmon04 nagios]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.3 (Tikanga) [root@ashprdmon04 nagios]# rpm -q python python-2.4.3-24.el5
|
# ? Aug 27, 2009 12:24 |
|
Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider.
|
# ? Aug 28, 2009 00:21 |
|
Ethereal posted:Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider. I've had plenty of success with Dreamhost.
|
# ? Aug 28, 2009 00:38 |
|
Ethereal posted:Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider. WebFaction is great when it comes to running Python projects. You can deploy pretty much however you want, and the admins are incredibly helpful.
|
# ? Aug 28, 2009 01:16 |
|
Ethereal posted:Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider. without a doubt webfaction.
|
# ? Aug 28, 2009 01:29 |
|
Ethereal posted:Does anyone have recommendations for a good host with Python as a usable language? I'm planning on setting up a small-ish site and don't need that much bandwidth per month, just a good reliable provider. Webfaction++
|
# ? Aug 28, 2009 02:43 |
|
So I know the liklihood of any of you running Python 3 regularly is slim, but I thought I'd throw this out there for you - GSoC just completed, and we now have an alpha version of a Python 3 to Python 2 conversion tool:quote:Hello all, Please, if you have the time - give it a try.
|
# ? Aug 28, 2009 02:45 |
|
HatfulOfHollow posted:And? Well thats poo poo too. Its a problem, because it slows adoption of 2.6/3.0 and that means python devs have to keep supporting legacy code bases, and provides disincentives for people to modernise their code for the superior 2.6/3.0 branches.
|
# ? Aug 28, 2009 06:58 |
|
root beer posted:I familiarized myself with XPaths using Hpricot and Firebug on RoR a while back and missed them while using BeautifulSoup; lxml looks like it'll cut a large chunk of work out of what I'm currently developing. lxml can also do CSS selectors, which are often simpler to read than XPath (it translates them into XPath to perform the query).
|
# ? Aug 28, 2009 07:16 |
|
There's also pyquery: http://pypi.python.org/pypi/pyquery I don't know what the performance is like, but it seems cool to me
|
# ? Aug 28, 2009 07:24 |
|
duck monster posted:Well thats poo poo too. Its a problem, because it slows adoption of 2.6/3.0 and that means python devs have to keep supporting legacy code bases, and provides disincentives for people to modernise their code for the superior 2.6/3.0 branches. I completely agree. But I've just started accepting that Redhat is going to be slow to adopt new versions of anything, even if it offers vast improvement. Hell, we're still building some boxes with RHEL4 and those REALLY suck... -bash-3.00$ cat /etc/redhat-release Red Hat Enterprise Linux AS release 4 (Nahant Update 4) -bash-3.00$ rpm -q python python-2.3.4-14.2
|
# ? Aug 28, 2009 13:55 |
|
duck monster posted:Well thats poo poo too. Its a problem, because it slows adoption of 2.6/3.0 and that means python devs have to keep supporting legacy code bases, and provides disincentives for people to modernise their code for the superior 2.6/3.0 branches. If I had poo poo-tons of money, I'd personally fly to the house of each OS maintainer and attempt to bribe them to at least upgrade to 2.6 latest. I just jumped to FC11 simply due to the 2.6 support. Later today I'm rebuilding my laptop with snow leopard for 2.6. We hateses old installs.
|
# ? Aug 28, 2009 14:04 |
|
I am very new to Python and programming in general. I am joining a lab and one of the initial projects is to trim a file. We currently have the following before a DNA sequence: >FBtr0071764 type=mRNA; loc=2R:join(18024938..18025756,18039159..18039200,18050410..18051199,18052282..18052494,18056749..18058222,18058283..18059490,18059587..18059757,18059821..18059938,18060002..18060346); ID=FBtr0071764; name=a-RB; dbxref=FlyBase_Annotation_IDs:CG6741-RB,FlyBase:FBtr0071764,REFSEQ:NM_079902; score=7; score_text=Moderately Supported; MD5=8ef39c97ee0a0abc31900fefe3fbce8f; length=5180; parent=FBgn0000008; release=r5.19; species=Dmel; We would like it to read: >CG6741-RB We have a file with about 1500 of these, each one followed by about 2000 characters (AGCT) for DNA sequence. What would be the best way to remove all of the unnecessary data?
|
# ? Aug 28, 2009 18:16 |
|
a regular expression similar to: re.findall(r"FlyBase_Annotation_IDs\.*?),", YOUR_BLOCK_OF_TEXT) should be a start code:
|
# ? Aug 28, 2009 18:24 |
|
EvoDevo posted:I am very new to Python and programming in general. I am joining a lab and one of the initial projects is to trim a file. We currently have the following before a DNA sequence: Been there, done that. Use a regex to capture the head and feature lines, something like r'^>[^;]+;$'. Then take the text that you captured and use a regex on it, something like r'FlyBase_Annotation_ID:([^,]*)' and replace the original text with the captured subpattern. You could combine the two regexes into one, but it might be easier to build / debug if you do it stage by stage.
|
# ? Aug 28, 2009 18:27 |
|
EvoDevo posted:I am very new to Python and programming in general. I am joining a lab and one of the initial projects is to trim a file. We currently have the following before a DNA sequence: A proper parser would be best, but you could probably get away with using regexes. Something like this: code:
|
# ? Aug 28, 2009 18:42 |
|
I don't think I've ever seen "well formed" output from scientific software. I have a friend working on her Ph.D. in I think it's biochem and she's spent so much time writing Perl and whatnot to deal with the output that she's considering a career in programming now. Most scientific software is, well, Dwarf Fortress to put it succinctly.
|
# ? Aug 28, 2009 18:52 |
|
Threep posted:I don't think I've ever seen "well formed" output from scientific software. I have a friend working on her Ph.D. in I think it's biochem and she's spent so much time writing Perl and whatnot to deal with the output that she's considering a career in programming now. Most scientific software is written by people who are self-taught programmers. If you ever doubt the value of university CS degrees (and being formally taught shibboleths like commenting, structured programmers and recursion) look at the source code of most bioinformatics software. I once went to hack on one prominent program, looking to make a simple change. "Huh, only six files. This should be easy." Each file was over 4000 lines long, uncommented, and full of variables like x, x2, x3, h, hh, foo and bar. I eventually gave up, finding it easier to rewrite the entire program.
|
# ? Aug 29, 2009 09:25 |
|
If you have a list that you know is sorted, is there a built in way to efficiently search the list? For example, say I have a list of primes under 1 million in ascending order, and I want to check a number to see if it's prime. I'm not very good at programming, but I know how to write a binary search. However, I think that a) there are probably even faster ways to search, b) python would have something built in. Thanks in advance.
|
# ? Aug 29, 2009 18:54 |
|
UnNethertrash posted:If you have a list that you know is sorted, is there a built in way to efficiently search the list? Binary search is the best way. You can use the bisect module.
|
# ? Aug 29, 2009 19:10 |
|
http://docs.python.org/tutorial/datastructures.html#more-on-lists you could also do list.count(). It returns 0 if something is not in a list, and 1 or more if it is. I don't know if that's the best way but it's certainly the least complicated.
|
# ? Aug 29, 2009 19:56 |
|
Or just, you know, use a set.
|
# ? Aug 29, 2009 20:08 |
|
|
# ? May 9, 2024 12:17 |
|
Aardlof posted:http://docs.python.org/tutorial/datastructures.html#more-on-lists This is definitely not the best way. Python has the bisect module as already stated which does a binary search on an ordered list.
|
# ? Aug 29, 2009 20:46 |