Python

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

FingersMaloy: Dec 23, 2004; Fuck! That's Delicious.

Hi goons. I am trying to teach myself to build a web scraper for Craigslist using Scrapy and BeautifulSoup. I've never programmed before, but I've worked through Python Crash Course, and now I'm using Web Scraping With Python.

I have this program that pulls exactly what I want into a CSV, if I manually feed it urls.

Python code:

from urllib2 import urlopen
from bs4 import BeautifulSoup
import csv

html = urlopen("I type the urls here")
soup = BeautifulSoup(html.read(), "html.parser")

ad_title = soup.title
for date_and_time in soup.findAll(class_="timeago"):
    date_posted = date_and_time.get("datetime")
body = soup.find(id="postingbody")
mapaddress = soup.find(class_="mapaddress")
for apt in soup.findAll(id="map"):
    lat = apt.get("data-latitude")
    lon = apt.get("data-longitude")

csvFile = open("test.csv", 'a')
try:
    writer = csv.writer(csvFile)
    writer.writerow((html, ad_title, date_posted, body, mapaddress, lat, lon))
finally:
    csvFile.close()

This works.

The problem I'm having is I can't get my own crawler to work. I can get Scrapy's tutorial crawler to run so I know everything is correctly installed. What I need, or I imagine I need, is a crawler that populates that variable "html" or reconfigure the "soup" variable and the rest of my code will finish it off. Again I'm relying on Scrapy's site:

Python code:

import scrapy
from bs4 import BeautifulSoup
import csv

class CraigslistSpider(scrapy.Spider):
	name = "craig"
	allowed_domain = "cleveland.craigslist.org/"
	start_urls = ["two really long Craigslist search queries"]

	def parse(self, response):
		soup = BeautifulSoup(response.text, 'lxml')

Can anyone help me connect these two things?

FingersMaloy fucked around with this message at 21:05 on Mar 23, 2017

# ¿ Mar 23, 2017 20:57

Adbot: ADBOT LOVES YOU

# ¿ May 2, 2024 23:27

FingersMaloy: Dec 23, 2004; Fuck! That's Delicious.

quote not edit

# ¿ Mar 23, 2017 21:04

FingersMaloy: Dec 23, 2004; Fuck! That's Delicious.

Yeah so I imagined this would work:

Python code:

import scrapy
from bs4 import BeautifulSoup
import csv

class CraigslistSpider(scrapy.Spider):
	name = "craig"
	allowed_domain = "cleveland.craigslist.org/"
	start_urls = ["https://cleveland.craigslist.org/search/apa"]

	def parse(self, response):
		soup = BeautifulSoup(response.text, 'lxml')
		ad_title = soup.title
		for date_and_time in soup.findAll(class_="timeago"):
			date_posted = date_and_time.get("datetime")
		body = soup.find(id="postingbody")
		mapaddress = soup.find(class_="mapaddress")
		for apt in soup.findAll(id="map"):
			lat = apt.get("data-latitude")
			lon = apt.get("data-longitude")
		csvFile = open("test.csv", 'a')
		try:
			writer = csv.writer(csvFile)
			writer.writerow((ad_title, date_posted, body, mapaddress, lat, lon))
		finally:
			csvFile.close()

~~It's throwing up a syntax error on line 24 ("finally:") right now that I can't resolve,~~ but is it right to have the CSV pieces in the parse method?

Also, I know I'm going to need to write come exceptions into that method later, once I know this will work.

It ran without error, but didn't write to the CSV file! Getting close.

FingersMaloy fucked around with this message at 21:48 on Mar 23, 2017

# ¿ Mar 23, 2017 21:39

FingersMaloy: Dec 23, 2004; Fuck! That's Delicious.

Thanks everyone! This is very helpful. I've reworked the CSV part to:

Python code:

with open("test.csv", 'a') as file:
			writer = csv.writer(file)
			writer.writerow((ad_title, date_posted, body, mapaddress, lat, lon))

That's a lot tidier.

baka kaba posted:

You've got a few for loops in there where you repeatedly assign the same variables, so they'll end up with whatever the last value was. If you're trying to write multiple records to the CSV file, you need to write one inside the loop each time around (or store them in say a list and then write the whole lot at the end)

I think you mean these two sections:

Python code:

		
for date_and_time in soup.findAll(class_="timeago"):
			date_posted = date_and_time.get("datetime")

for apt in soup.findAll(id="map"):
			lat = apt.get("data-latitude")
			lon = apt.get("data-longitude")

Without going back and finding the full bit of HTML, I'm trying to pull "datetime", "data-longitude", and "data-latitude" from the HTML, but I couldn't figure out how to make it happen without pulling the whole tag which has several variables and then breaking it out.

I'm going to work making the whole thing create a list and then write the list to the CSV, if I can figure that out. Thanks again all!

# ¿ Mar 23, 2017 23:32

FingersMaloy: Dec 23, 2004; Fuck! That's Delicious.

I'm still trying to make this scraper work. I've abandoned BeautifulSoup and totally committed to Scrapy, my spider works but I can't make it pull the exact pieces I need. I'm using this code as my guide, but it's not fully working:

https://github.com/jayfeng1/Craigslist-Pricing-Project/blob/master/craigslist/spiders/CraigSpyder.py

He explains the methodology here:

http://www.racketracer.com/2015/01/29/practical-scraping-using-scrapy/

Python code:

# -*- coding: utf-8 -*-
import scrapy
from craig.items import CraigItem

class BasicSpider(scrapy.Spider):
    name = "basic"
    allowed_domains = ["https://cleveland.craigslist.org/"]
    start_urls = ['https://cleveland.craigslist.org/search/hhh?query=no+section+8&postedToday=1&availabilityMode=0']

    def parse(self, response):
        titles = response.xpath("//p")
        for titles in titles:
            item = CraigItem()
            item['title'] = titles.xpath("a/text()").extract()
            item['link'] = titles.xpath("a/@href").extract()
            items.append(item)
            follow = "https://cleveland.craigslist.org" + item['link']
            request = scrapy.Request(follow , callback=self.parse_item_page)
            request.meta = item
            yield request
        
    def parse_item_page(self, response):
        maplocation = response.xpath("//div[contains(@id,'map')]")
        latitude = maplocation.xpath('@data-latitude').extract()
        longitude = maplocation.xpath('@data-longitude').extract()
        if latitude:
            item['latitude'] = float(latitude)
        if longitude:
            item['longitude'] = float(longitude)
        return item

On line 18 (follow =), I get: TypeError cannot concatenate 'str' and 'list' objects.

I run this command to execute the program: scrapy crawl basic -o items.csv -t csv. If remove the second method I can get a spreadsheet with titles and links, but I need the geotag.

Any ideas?

# ¿ Apr 3, 2017 20:28

Adbot: ADBOT LOVES YOU

# ¿ May 2, 2024 23:27

FingersMaloy: Dec 23, 2004; Fuck! That's Delicious.

I changed the first titles to title and edited the loop but it's throwing the same error.

# ¿ Apr 4, 2017 02:06

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python