Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
FingersMaloy
Dec 23, 2004

Fuck! That's Delicious.
Hi goons. I am trying to teach myself to build a web scraper for Craigslist using Scrapy and BeautifulSoup. I've never programmed before, but I've worked through Python Crash Course, and now I'm using Web Scraping With Python.

I have this program that pulls exactly what I want into a CSV, if I manually feed it urls.
Python code:
from urllib2 import urlopen
from bs4 import BeautifulSoup
import csv

html = urlopen("I type the urls here")
soup = BeautifulSoup(html.read(), "html.parser")

ad_title = soup.title
for date_and_time in soup.findAll(class_="timeago"):
    date_posted = date_and_time.get("datetime")
body = soup.find(id="postingbody")
mapaddress = soup.find(class_="mapaddress")
for apt in soup.findAll(id="map"):
    lat = apt.get("data-latitude")
    lon = apt.get("data-longitude")

csvFile = open("test.csv", 'a')
try:
    writer = csv.writer(csvFile)
    writer.writerow((html, ad_title, date_posted, body, mapaddress, lat, lon))
finally:
    csvFile.close()
This works.

The problem I'm having is I can't get my own crawler to work. I can get Scrapy's tutorial crawler to run so I know everything is correctly installed. What I need, or I imagine I need, is a crawler that populates that variable "html" or reconfigure the "soup" variable and the rest of my code will finish it off. Again I'm relying on Scrapy's site:
Python code:
import scrapy
from bs4 import BeautifulSoup
import csv

class CraigslistSpider(scrapy.Spider):
	name = "craig"
	allowed_domain = "cleveland.craigslist.org/"
	start_urls = ["two really long Craigslist search queries"]

	def parse(self, response):
		soup = BeautifulSoup(response.text, 'lxml')
Can anyone help me connect these two things?

FingersMaloy fucked around with this message at 21:05 on Mar 23, 2017

Adbot
ADBOT LOVES YOU

FingersMaloy
Dec 23, 2004

Fuck! That's Delicious.
quote not edit

FingersMaloy
Dec 23, 2004

Fuck! That's Delicious.
Yeah so I imagined this would work:

Python code:
import scrapy
from bs4 import BeautifulSoup
import csv

class CraigslistSpider(scrapy.Spider):
	name = "craig"
	allowed_domain = "cleveland.craigslist.org/"
	start_urls = ["https://cleveland.craigslist.org/search/apa"]

	def parse(self, response):
		soup = BeautifulSoup(response.text, 'lxml')
		ad_title = soup.title
		for date_and_time in soup.findAll(class_="timeago"):
			date_posted = date_and_time.get("datetime")
		body = soup.find(id="postingbody")
		mapaddress = soup.find(class_="mapaddress")
		for apt in soup.findAll(id="map"):
			lat = apt.get("data-latitude")
			lon = apt.get("data-longitude")
		csvFile = open("test.csv", 'a')
		try:
			writer = csv.writer(csvFile)
			writer.writerow((ad_title, date_posted, body, mapaddress, lat, lon))
		finally:
			csvFile.close()
It's throwing up a syntax error on line 24 ("finally:") right now that I can't resolve, but is it right to have the CSV pieces in the parse method?

Also, I know I'm going to need to write come exceptions into that method later, once I know this will work.

It ran without error, but didn't write to the CSV file! Getting close.

FingersMaloy fucked around with this message at 21:48 on Mar 23, 2017

FingersMaloy
Dec 23, 2004

Fuck! That's Delicious.
Thanks everyone! This is very helpful. I've reworked the CSV part to:

Python code:
with open("test.csv", 'a') as file:
			writer = csv.writer(file)
			writer.writerow((ad_title, date_posted, body, mapaddress, lat, lon))
That's a lot tidier.

baka kaba posted:

You've got a few for loops in there where you repeatedly assign the same variables, so they'll end up with whatever the last value was. If you're trying to write multiple records to the CSV file, you need to write one inside the loop each time around (or store them in say a list and then write the whole lot at the end)

I think you mean these two sections:
Python code:
		
for date_and_time in soup.findAll(class_="timeago"):
			date_posted = date_and_time.get("datetime")

for apt in soup.findAll(id="map"):
			lat = apt.get("data-latitude")
			lon = apt.get("data-longitude")
Without going back and finding the full bit of HTML, I'm trying to pull "datetime", "data-longitude", and "data-latitude" from the HTML, but I couldn't figure out how to make it happen without pulling the whole tag which has several variables and then breaking it out.

I'm going to work making the whole thing create a list and then write the list to the CSV, if I can figure that out. Thanks again all!

FingersMaloy
Dec 23, 2004

Fuck! That's Delicious.
I'm still trying to make this scraper work. I've abandoned BeautifulSoup and totally committed to Scrapy, my spider works but I can't make it pull the exact pieces I need. I'm using this code as my guide, but it's not fully working:

https://github.com/jayfeng1/Craigslist-Pricing-Project/blob/master/craigslist/spiders/CraigSpyder.py

He explains the methodology here:

http://www.racketracer.com/2015/01/29/practical-scraping-using-scrapy/

Python code:
# -*- coding: utf-8 -*-
import scrapy
from craig.items import CraigItem

class BasicSpider(scrapy.Spider):
    name = "basic"
    allowed_domains = ["https://cleveland.craigslist.org/"]
    start_urls = ['https://cleveland.craigslist.org/search/hhh?query=no+section+8&postedToday=1&availabilityMode=0']

    def parse(self, response):
        titles = response.xpath("//p")
        for titles in titles:
            item = CraigItem()
            item['title'] = titles.xpath("a/text()").extract()
            item['link'] = titles.xpath("a/@href").extract()
            items.append(item)
            follow = "https://cleveland.craigslist.org" + item['link']
            request = scrapy.Request(follow , callback=self.parse_item_page)
            request.meta = item
            yield request
        
    def parse_item_page(self, response):
        maplocation = response.xpath("//div[contains(@id,'map')]")
        latitude = maplocation.xpath('@data-latitude').extract()
        longitude = maplocation.xpath('@data-longitude').extract()
        if latitude:
            item['latitude'] = float(latitude)
        if longitude:
            item['longitude'] = float(longitude)
        return item
On line 18 (follow =), I get: TypeError cannot concatenate 'str' and 'list' objects.

I run this command to execute the program: scrapy crawl basic -o items.csv -t csv. If remove the second method I can get a spreadsheet with titles and links, but I need the geotag.

Any ideas?

Adbot
ADBOT LOVES YOU

FingersMaloy
Dec 23, 2004

Fuck! That's Delicious.
I changed the first titles to title and edited the loop but it's throwing the same error. :(

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply