Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
LochNessMonster
Feb 3, 2005

I need about three fitty


Not sure if this belongs here, but I'm starting to learn programming with Python and am trying to build a webscraper that stores information in some sort of a database.

Planning to scrape like 10 different small websites with items and a handfull of attributes.

What database would be easy to use with Python? I was thinking something like CouchDB, or SQLite.

Adbot
ADBOT LOVES YOU

LochNessMonster
Feb 3, 2005

I need about three fitty


Nippashish posted:

SQLite is easiest because it doesn't require a database server, the catch is you can't have multiple concurrent writers to a sqlite database so if you want your scraper to be multithreaded then you should use something else. If you're just starting programming then you are probably not writing a multithreaded scraper anyway, so you should use SQLite. You might also want to look into sqlalchemy, which is an awesome library for working with all kinds of databases (SQLite included), although it's kind of big and complicated and maybe not very easy to set up for a beginner, especially if you've never worked with something similar before.

tl;dr: Use SQLite. SQLite is awesome.

I don't think I'll make it multithreaded, definately not straight away. I'm familiar with databases and can make intermediate sql queries so I might take a look at sqlalchemy as well.

The last time I did some programming was in college (java, 15 years ago or something). So besides the general concepts I'm completely new. Especially to python.

Is it ok to post my code here to let others have a look at it and point out design flaws and/or improvements?

LochNessMonster
Feb 3, 2005

I need about three fitty


Thanks for the advice. I did look at the project.log but felt kinda silly for starting one as it's my first ever project.

I currently have 1 long python script that scrapes a site, turns it to a beautifulsoup object and parses information I want.

For now I print the output to stdout but I want to start storing it, hence my db question.

The code is starting to get longer so I was contemplating on making different scripts. One scraper, one paraser and ome that writes stuff to database. I don't want to hammer sites and scrape 10 pages 20 times in a row to test parsing code for example.

My biggest question is if I should create methods and/or classes to seperate code. I havent quite figured out when I should use those.

LochNessMonster
Feb 3, 2005

I need about three fitty


Dex posted:

also do yourself a favour and try to get in the mindset of test driven development. it might seem like serious overkill for a first project("why would i need a test for something that just returns a url string? waste of time!"), but i find writing tests up front helps me reason about the flow of an application a lot easier than banging out a bunch of code that just does it first and trying to make it sane later.

Any guides/tutorials/guidelines I can look into on that? What I've done so far is just test every minor thing I'm trying to build as a seperate standalone app with static and/or test data.

That and print every variable/output/whatever I'm trying to do to confirm it does what I want it to do.

tef posted:

i'd recommend sqlite, but i'd also suggest maybe peewee, depending how much sql you want to pick up

ps, i also like lxml, requests, you might too.

I already found out about requests, it sure makes my life easier. I'll look into lxml and peewee, but my main goal is to learn python, so I'm not sure if peewee will get a fair chance :)


onionradish posted:

You might take a look at vcr.py, which gives an easy way to automatically '"record and play back" HTTP traffic so you can run your script/parser as much as you want without hitting the servers while you're testing or doing development on the parser.

You just wrap your web request in a with block, or add a decorator (vcr.py supports urllib2, urllib3, requests and others automatically):
Python code:

with vcr.use_cassette('fixtures/vcr_cassettes/synopsis.yaml'):
    response = urllib2.urlopen('http://www.iana.org/domains/reserved').read()
# -- or --
@vcr.use_cassette('fixtures/vcr_cassettes/synopsis.yaml')
def test_iana():
    response = requests.get('http://www.iana.org/domains/reserved').content

I currently write the files to disk and use those files to run my parsing tests again.

My parser is pretty straight forward at the moment. I get the first page, see how many pages there are and get those too.

I'm not sure how vcr will help me do this easier (without making it too complex for me). Looking at your code snippets confuses me as I have no idea how to use it in my current code. I really am a beginner at the bottom level here.

LochNessMonster
Feb 3, 2005

I need about three fitty


Not sure what's going wrong here.

code:
import os, tarfile
baseFileName = "data"
dataDir = "data/"
dataFiles = os.listdir(dataDir)


#adds all files in dataFiles, including directories and unwatned files.
with tarfile.open("data/archive/test.tar", "w") as tar:
    for name in dataFiles:
        if name.endswith(".data"):
            tar.add(str(name))
gives the following error message.

if I change the last line to:

code:
tar.add(str("something" + name"))


it does the trick. But I don't actually want to have files called something<name>. Any clues what I'm doing wrong?

The error message I get is:

code:
Traceback (most recent call last):
  File "/home/dir/project.py", line 11, in <module>
    tar.add(str(name))
  File "/usr/lib/python3.4/tarfile.py", line 1907, in add
    tarinfo = self.gettarinfo(name, arcname)
  File "/usr/lib/python3.4/tarfile.py", line 1779, in gettarinfo
    statres = os.lstat(name)
FileNotFoundError: [Errno 2] No such file or directory: 'htmlpage8.data'
edit:

Of course I figure out what's wrong seconds after posting.

the "something" is actually the subdirectory I'm opening the .data files from (/home/dir/project/data/htmlpage*.data).

Without including the path, it can't find the files. I guess I just need to find out how to tar files without the path data.

LochNessMonster fucked around with this message at 21:17 on Sep 12, 2016

LochNessMonster
Feb 3, 2005

I need about three fitty


Master_Odin posted:

You could use os.chdir() to set your working path to "something" and then you could just use the name of the file.

Then you'd probably need to use os.getcwd() to get original directory, change into directory with files, add files to tar then move tar to original directory after you're done.

Or just make a copy of each file in original directory, add to tar, delete copy.

I fixed it with:

tar.add(str("data/" + name), arcname=str(name))

LochNessMonster
Feb 3, 2005

I need about three fitty


So my (newbie) webscraping project is actually making some progress. I'm scraping a site with motorcycles and would like to store them in a sqlite3 database but am not sure on how to proceed.

I can get the information I want, which is the brand, type, mileage, year and the dealer who sells it. For each entry I scrape I put the values in variables.

For now I just want to write them to disk or database, but in the end I'd like to identify each one of them uniquely (unfortunately license plates are usually not listed, so I need to think of something for that...).

As for structuring the data I've been doing some reading but I can't really see what I should use to store this info. Do I use a list, tuple or dictionary?

Ideally I'd parse the info for 100-200 items with 5 attrinutes each. Would it be a good idea to create lists in a list and the order of the items relates to the value inside the list? Like the first item is brand, 2nd type, 3rd milage, etc? And then write them to sqlite with list.pop or something? I'm really eager to hear how you approach questions like these because I don't really know how to proceed.

LochNessMonster
Feb 3, 2005

I need about three fitty


Thermopyle posted:

Use a database. A database has tables. Each table is kind of like a spreadsheet. So you'd have a field (column in spreadsheet) for each attribute.

You use some sort of structure like a dictionary when you're dealing with the items in python, and then map the attributes in your python code to fields in the database.

I'm definately going to write the data to a database. What I'm struggling with is how to structure the data with my script before writing to disk.

The program currently scrapes html pages that are saved to disk. Each page has 10 vehicles in it, so I loop through the divs parsing the values for brand/type/milage/year/dealer.

That's where I am right now.

To get that data to a database I figured I should probably "store" that data inside the script for at least all vehicles on 1 page, or for all vehicles on all the pages.

When I have parsed 1 page (and thus 10 vehicles) I write all of them to the database at once, instead of opening (and closing) a db connection for each iteration of the page loop(s).

So I think I need to put the vehicle info in a list, tuple or dictionary before writing it to the database.

I could be missing something obvious though.

LochNessMonster
Feb 3, 2005

I need about three fitty


Tigren posted:

That's exactly what dicts and lists are for. So you could have a list called list_of_motorcycles and each motorcycle could be a dict, which links keys to values. Once you have parsed all the information for one motorcycle, append that dictionary to the end of your list_of_motorcycles and move to the next one. Then, at the end, you've got a big ol' list of motorcycle dictionaries.

Python code:

list_of_motorcycles = list() # This establishes an empty list called list_of_motorcycles

# loop through the divs parsing the values for brand/type/milage/year/dealer. 
brand = 'Kawasaki'
type = 'Ninja'
mileage = 10000
year = '2016'
dealer = 'Crazy Ed'
motorcycle = {'brand': brand,
                          'type': type,
                          'mileage': mileage,
                          'year': year,
                          'dealer': dealer}
list_of_motorcycles.append(motorcycle)
# end of loop

Thank you, this what I had in mind as a concept, but I had no clue if it would be a good and/or efficient way of doing it. I also didn't know what type I should've used. Thanks for helping me figuring this out!

After I manage to do this, I can create a loop based on Suspicious Dishs sql query to write to a database.

LochNessMonster
Feb 3, 2005

I need about three fitty


Master_Odin posted:

If all you're doing is taking the data from the form and not doing anything meaningful with it before insertion into the DB, you may as well as just insert it into DB immediately. There's no penalty leaving the database connection open for the length of scraping, especially since it's sqlite.

I'm really new to programming so I'm taking baby steps. In the near future I will be doing stuff with the data before inserting it. Good to know I could keepthe connection open if I for some reasonneed to in the future though.

LochNessMonster
Feb 3, 2005

I need about three fitty


Dex posted:

you could also use sqlalchemy for this.


bit more overhead but it can be easier to maintain if your project starts growing.

It looks a bit more complicated, is it something I should be able to do as a complete beginner?

What are the pros/cons compared to the plain sqlite3 method?

LochNessMonster
Feb 3, 2005

I need about three fitty


Thermopyle posted:

Use plain sqlite for now.

What you learn will be super useful when you go to start using an ORM like sqlalchemy as it'l help you understand it.

Was just reading about sqlalchemy and while it looks like overkill for now, I might indeed need it in the future.

I'll start fiddling with sqlite for now.

edit:

I must say I'm starting to enjoy this a lot more than I thought I would. I once had a Java programming class which had a horrible teacher (several years consistant 90% of the class failing) and thought programming sucks balls.

A few years back I was trying to find a specific brand/type motorcycle and I was quite annoyed I had to manually check a lot of dealer sites to see if they had any, so I figured that'd be a nice project to start with.

I currently now have a scraper/parser working that gets roughly 100 vehicles from 1 dealer who has a standard webgallery which I've seen several other dealers use as well. Hopefully I can get the data of another 10-20 dealers this way.

I'm pretty sure I'll be running into a lot of design flaws in a while, but figuring those out seems like half the fun.

LochNessMonster fucked around with this message at 19:59 on Sep 16, 2016

LochNessMonster
Feb 3, 2005

I need about three fitty


I'm probably going to figure this out within 5 minutes after posting, but I've been struggling with inserting data into sqlite 3 and I'm not quite sure why.

code:
import sqlite3
conn = sqlite3.connect('vehicles.db')
cursor = conn.cursor()

testDict = {'DealerId': '1',
            'Brand': 'bmw',
            'Milage': '13.909',
            'Dealer': 'crazy jack',
            'Price': '5.000',
            'Type': 'r90',
            'Year': '1980'}

for (DealerId, Dealer, Brand, Type, Price, Milage, Year) in testDict:
    cursor.execute("INSERT INTO motorcycles VALUES(?, ?, ?, ?, ?, ?, ?)",
                   DealerId,
                   Dealer,
                   Brand,
                   Type,
                   Price,
                   Milage,
                   Year)
Running this results in the following error message

code:
Traceback (most recent call last):
  File "/path/to/script/dbinsert.py", line 13, in <module>
    for (DealerId, Dealer, Brand, Type, Price, Milage, Year) in testDict:
ValueError: too many values to unpack (expected 7)
Please point and laugh at my failures :downs:

LochNessMonster
Feb 3, 2005

I need about three fitty


ArcticZombie posted:

I've never used sqlite3 so I'm winging this a bit, but I don't think your problem is with sqlite3. Looping over a dict doesn't work like that. The for loop will loop over each key in the dict:
Python code:

>>> testDict = {"foo":1,"bar":2,"baz":3}
>>> for key in testDict:
...    print(key)
foo
bar
baz

Your execute should look more like:
Python code:

cursor.execute("INSERT INTO motorcycles VALUES(?, ?, ?, ?, ?, ?, ?)",
               testDict["DealerId"],
               testDict["Dealer"],
               testDict["Brand"],
               testDict["Type"],
               testDict["Price"],
               testDict["Milage"],
               testDict["Year"])

With no need for the loop over it.

The test dictionary only contains 1 entry, but the real resultset will have tens of entries. I just started out with a smaller data set to spot problems early on.

Your example should get me up and running though, I'm try and write something that inserts just the test dictionary and then see how I can write a loop around it so I can repeat it for multiple entries.

LochNessMonster
Feb 3, 2005

I need about three fitty


ArcticZombie posted:

The real results will be a list of dicts, no? You'll loop over the list, not over the dicts.

yeah, you're right. It's a list containing dictionaries.

I'm really new to this, and still struggling a bit with the different types of data structures (and how to proces them)

LochNessMonster
Feb 3, 2005

I need about three fitty


Tigren posted:

It looks like other people squared you away on your for loop. I wanted to show you this option for inserting from a dict as well. Instead of the question mark "parameterization", you can use named placeholders. Those names are the keys of your dict.

[..]

There's also the option of using cursor.executemany, which allows you to loop over a list of entries like so:


Both options seem like a pretty efficient way of doing things. I didn't get the for loop working yet so I'll give this a try as well.

Just for my understanding, the: in/out [number] is from your python environment?

LochNessMonster
Feb 3, 2005

I need about three fitty


When using the for loops I keep getting the same error, no matter what I try.

When running this:
Python code:
import sqlite3
conn = sqlite3.connect('vehicles.db')
cursor = conn.cursor()

testList = [{'DealerId': '2',
            'Brand': 'bmw',
            'Milage': '13.909',
            'Dealer': 'crazy jack',
            'Price': '5.000',
            'Type': 'r90',
            'Year': '1980'},
            {'DealerId': '3',
            'Brand': 'goldwing',
            'Milage': '15.564',
            'Dealer': 'Mohawk Mike',
            'Price': '8.000',
            'Type': 'goldwing',
            'Year': '2015'}]


for dictionary in testList:
    cursor.execute("INSERT INTO motorcycles VALUES(?, ?, ?, ?, ?, ?, ?)",
                   dictionary["DealerId"],
                   dictionary["Dealer"],
                   dictionary["Brand"],
                   dictionary["Type"],
                   dictionary["Price"],
                   dictionary["Milage"],
                   dictionary["Year"])
I keep getting:

code:
Traceback (most recent call last):
  File "/path/to/program/dbinsert2.py", line 29, in <module>
    dictionary["Year"])
TypeError: function takes at most 2 arguments (8 given)
I'm trying to figure out why it only takes 2 arguments and how it's counting 8 that are being given. I really only count 7 question marks and 7 dictionary values.


So I went and tried Tigrens suggestion on how to tackle this. It doesn't give any errors, but for some reason it doesn't really enter the values either. When I do a SELECT * FROM motorcyles; it only returns 1 bogus value I added manually.

Python code:
import sqlite3
conn = sqlite3.connect('vehicles.db')
cursor = conn.cursor()

testList = [{'DealerId': '2',
            'Brand': 'bmw',
            'Milage': '13.909',
            'Dealer': 'crazy jack',
            'Price': '5.000',
            'Type': 'r90',
            'Year': '1980'},
            {'DealerId': '3',
            'Brand': 'goldwing',
            'Milage': '15.564',
            'Dealer': 'Mohawk Mike',
            'Price': '8.000',
            'Type': 'goldwing',
            'Year': '2015'}]

cursor.executemany("INSERT INTO motorcycles VALUES (:Brand, :Dealer, :DealerId, :Milage, :Price, :Type, :Year)", testList)



Time to figure out how to do error handling I guess.

LochNessMonster fucked around with this message at 10:19 on Sep 22, 2016

LochNessMonster
Feb 3, 2005

I need about three fitty


Hammerite posted:

Do you need to commit the transaction?

I feel pretty stupid now...

LochNessMonster
Feb 3, 2005

I need about three fitty


Thanks for all the feedback. I managed to get it working with Tigrens way. The code is now looking like this:

Python code:
<scraping part>
motorcycle = { 'Brand': getBrand.get_text(),
                       'Type': getType.get_text(),
                       'Price': getPrice.get_text(),
                       'Milage': getMilage.get_text(),
                       'Dealer': dealerName,
                       'Year': getYear.get_text(),
                       'DealerId': getDealerId['id']}
        processingList.append(motorcycle)

dbPath = str(dataDir + dbName)

try:
    conn = sqlite3.connect(dbPath)
except Exeption as err:
    print('Could not connect to vehicles.db because reasons: ' + str(err))
cursor = conn.cursor()

try:
    cursor.executemany("INSERT INTO motorcycles VALUES (:DealerId, :Dealer, :Brand, :Type, :Price, :Milage, :Year)", processingList)
    conn.commit()

except Exception as err:
    print('An exception happened: ' +str(err))

conn.close()
It's really cool to see results while learning a useful new skill. I'm really enjoying this, and already have lots of other things in mind to incorporate in it.

For now it's still a lot of trial and error, and thinking about what I'm doing wrong but when I started a few weeks back I knew nothing at all (compared to you guys I guess that's still the case) but now I seem to find my way a bit.

LochNessMonster fucked around with this message at 17:26 on Sep 22, 2016

LochNessMonster
Feb 3, 2005

I need about three fitty


Eela6 posted:

LochNess, glad to see you're enjoying python. It is by far my favorite language.

As a small piece of advice, try and avoid the construct
code:

except Exception:
#code
Unless you're planning to re-raise the exception. Catching and handling specific errors is OK, but generally ignoring ALL errors is asking for trouble later on. This is the sort of thing that seems exponent at the time but can make debugging your program down the line basically impossible.

Good luck with your coding!

Thanks for the feedback, could you explain what is actually wrong and how I could improve on that? I thought I'd catch the errors for that specific code and print it to stdout. That way I knew what went wrong.

LochNessMonster
Feb 3, 2005

I need about three fitty


Reading these explenations and those links clears things up a bit (or a lot actually).

The reason I initially did it this way was a) because I didn't know any better, and b) I had no idea which other (specific) Errors I was looking for.

The reason I put it in there, was because my program seem to get all the input and put it in dictionary and added each dictionary to the list. But it didn't write the data to the database. This actually happened due to several reasons, some of which I found out with except Exception. Reasons were: database was in a different path than my program was looking in, table had a typo so couldn't insert and finally I didn't commit the changes.

What I didn't realize is that I could possibly catch lots of other Exceptions. So I'm going to rewrite this with trying to catch a sqlite3.DatabaseError for the connection, and a sqlite3.IntegrityError to see if my insert is working properly (if it isn't, I don't want to pollute the database).

LochNessMonster
Feb 3, 2005

I need about three fitty


SurgicalOntologist posted:

I'll just add that if you're only printing out the error message, you're not getting any more information than if you had just let it hit. Eela6's suggestion is a good reason to catch exceptions while debugging but for it to provide benefits you need to print out some other useful information after catching. E.g. the value of some variables that may be related to the error. Finally, as in Eela6's example, you should still re-raise the exception if you're only catching it for debugging purposes (that is, if you don't do some error-handling that will allow your program to continue regardless).

I'll also add that I usually re-raise an exception with a bare "raise" rather than "raise err". I think this provides a better traceback, but I was not able to confirm this with a quick Google.

I still have to wrap my head around the raising and reraising you guys keep mentioning. I did print out a lot of other variables (and sometimes type(var) for the bs4 variables) tp get an idea what was going wrong.
I cleaned that part up before posting though.

LochNessMonster
Feb 3, 2005

I need about three fitty


My web scraper project for scraping motorcycle dealers site is coming along nicely. I've noticed that there's a large majority of dealers using about 4 different inventory modules. I"m currently focussing on one of them, and the scraping is coming along nicely. The parsing of the data is also coming along properly, but it appears that almost every dealer has some small modification in the way the divs or spans are named. For example the vehicle price can be found in <div class=price>, <div class=vehiclePrice>, <div class=vehiclePrices> or even <span class=vehiclePrice>. So plenty of variations, and this is not just the case for Price, but happens for Milage and Year as well.

What started as "oh well, I'll just create one parser per dealer" is now becoming a bit annoying since 99% of the code for each dealer appears to be the same, but I have a hard time figuring out how to move away from 1 parser per dealer and making this into a single parsing application.

I was hoping anyone has an idea on how to solve this. There are so many variations floating around I'm keeping a spreadsheet tracking on in which divs/spans each dealer keeps their attributes. I'm sure there's a smarter solution for that, but I can't really come up with something.

Python code:
#!/usr/bin/python3
dealerName = "Jack"
baseFileName = dealerName + "-html"
dataDir = "data/"
dbName = 'vehicles.db'
countVehicles = 0
countPages = 0
processingList = list()

#need to find out how many files there are. 
dataFiles = os.listdir(dataDir)

#loop through all pages to parse.
for file in dataFiles:
    if file.startswith(dealerName):
        openData = open(str(dataDir + file))
        dataSoup = bs4.BeautifulSoup(openData, 'html5lib')
        refineData = dataSoup.find_all("div", {"class": "vehicle"})
   
        print("Processing Page: " + file)
        countPages += 1
        
        #get tags & data 
        for vehicle in refineData:
            countVehicles += 1
            getBrand = vehicle.find("div", {"class": "brand"})
            getType = vehicle.find("div", {"class": "type"})
            getPriceDiv = vehicle.find("div", {"class": "vehicleSpecs"})
            getPrice = getPriceDiv.find("span", {"class": "price"})
            getMilage = vehicle.find("span", {"class": "vehicleTeller"})
            getYearText = vehicle.find("span", {"class": "vehicleYear"})
            getYear = getYearText.next_sibling.next_sibling
            getDealerId = vehicle.find(True,{'id':True})
        
            #put data in dictionary 
            motorcycle = { 'Brand': getBrand.get_text(),
                           'Type': getType.get_text(),
                           'Price': getPrice.get_text(),
                           'Milage': getMilage.get_text(),
                           'Dealer': dealerName,
                           'Year': getYear.get_text(),
                           'DealerId': getDealerId['id']}

            #Add dictionary to list
            processingList.append(motorcycle)

< after this there's code for the data being written to sqlite3 db>

LochNessMonster
Feb 3, 2005

I need about three fitty


Extortionist posted:

Why store that information in a spreadsheet instead of in a dictionary in the code or in the db, which the script could then refer to when doing the parsing?

Alternatively, if the variations are simple enough, you can pass regex objects to bs.find() instead of explicit strings.

Unfortunately the divs vary in position as well, and have a multitude of different parent divs, so regexes are out of the question. I could create some sort of dictionary with all different options, and then keep track which options work for each dealer. Really need to think this through because it appears to be a bit complicated. Sometimes I need to get a specific parent div, because the span class I'm looking for actually occurs multiple times in the div that contains all vehicle info.

I'm not even sure if my spreadsheet is 100% correct, so I guess I need to start take inventory first. I quickly checked for Price, and that only has 3 variations by itself.

code:
getPriceDiv = vehicle.find("div", {"class": "vehicleSpecs"})
getPrice = getPriceDiv.find("span", {"class": "price"})

getPrice = vehicle.find("span", {"class": "price"})

getPriceDiv = vehicle.find("div", {"class": "vehiclePrices"})
getPrice = getPriceDiv.find("span", {"class": "price"})
the span class seems to be the same for each one, so I just need to determine if it's unique and if so, which div I need to get first.

baka kaba posted:

I'm not sure you can really get around it, if everyone's doing things slightly differently then the best you can do is be as smart as possible with your selector/parsing code. You could try and generalise and get some definitions that work on every page, but honestly that just makes things complicated when they inevitably change something and create some new cases you have to catch. Maybe try making a separate dict or whatever for each page, just to handle getting the set of elements, and write a common scraper that just uses the appropriate class to get the data it needs to work on

You might actually be able to do it using CSS selectors as strings. Like, I *think* this is right (haven't tested it)
Python code:
cool_bikes = {
    'brand_selector': "div.brand"
    'type_selector': "div.type"
    'price_selector': "div.vehicleSpecs > span.price" # combined two steps there, y'see
}

current_page = cool_bikes
type = vehicle.select(current_page.type_selector)
and so on. Or use a class, something like that, so you have a common interface for getting the selectors a specific page uses

You might want to use methods instead though, in case a page has something that needs more logic than a selector can provide (like having to find a certain sibling)

I'm pretty new to python (this i my first project, I'm slowly but steadily expanding my knowledge) but I haven't looked at methods or classes yet.
Could you explain what your code snippet does exactly, I'm a bit at a loss here...

edit: to clarify, I see what you're doing, but I don't really understand how to apply that to my code.

LochNessMonster fucked around with this message at 18:05 on Oct 6, 2016

LochNessMonster
Feb 3, 2005

I need about three fitty


subway masturbator posted:

A department of my job also works with parsing data scraped from disparate websites and they learned that even if the sites are similar or share some parsing code, you are better off treating each site as a special snowflake.

Granted, if in your case the sites are similar enough, you might be able to get away with creating a framework of sorts for parsing, and abstracting away the specific parts for each site, and implementing it on specific modules/packages for each site.

The dealers I'm currently looking at all use the same sort (of the shelves) inventory software. They just configured it differently apparently (or are working with slightly different versions).

I just looked at all the scripts I have, and so far I have 20 dealers, and in total 7 different variations. It's not as bad as I thought. There are 4 variations on Price, 2 on Milage and 3 on Year.

Would it be an idea to expand my dealerDict and adding flags for these variations, and then create variables with the values for each price, milage and year flag?

Python code:
dealerDict = {"dealer1": [siteurl, price4, milage2, year1],
			"dealer2": [siteurl, price2, milage1, year2]
			}
It kinda looks like a dirty solution, but I guess it could work. Or maybe I should just store the flags in the dealer table of my database and read it from there.

LochNessMonster
Feb 3, 2005

I need about three fitty


baka kaba posted:

(sorry, I screwed up and didn't use dict[key] notation for pulling out the selector strings in the dictionary!)

The idea is that instead of having separate chunks of code for each site, basically repeating the same functionality but with slightly different ways of getting the same information, you have a general set of steps that delegates the details to some other object. That way you can repeat the same steps, but with different objects handling those details, each one representing a site and its particular way of getting the data you need. Like a specialist

So the code I posted was basically holding a bunch of CSS selector strings which (should!) correspond to the selectors you posted for each bit of data in your example. If you do vehicle.select("div.brand") for example, it should do the same as vehicle.find("div", {"class": "brand"}). But because it's a single string, you can stick it in a dictionary with a specific key, and just pull it out and put straight into the select call. vehicle.select(site_dict['brand_selector'])

Thanks for explaining, this makes a lot more sense to me now. CSS Selectors seem like a good (better) solution than what I'm currently doing.


quote:

The reason that's useful is you can create that dict for one site, with all the selectors you need for each piece of data. Then you can create another dict for another site, with its own special snowflake selectors, but using the same key names. That way your main code (that does the selecting) can use the exact same calls
Python code:

cool_bikes = { 'brand_selector': 'div.brand' }
rad_bikes = { 'brand_selector': 'a.manufacturer' }

for site in (cool_bikes, rad_bikes):
    # get refine_data...
    for vehicle in refine_data:
        brand = vehicle.select(site['brand_selector'])

if you see what I'm doing there. You get to reuse the code because you're changing the site object, which takes a generic call but provides the specific data for that site. You're basically looking up a common key in each dictionary, and getting a value specific to each site. You can add more into the dictionary too, like a 'filename' key so your loop can load up the correct file for each site. You're basically querying the current site's dictionary to get the specifics each time

I'm not sure if creating a dictionary per site is the way I want to go for now. I have 20 sites so far and have 3 attributes I'd like to get which each have 4, 2 and 5 options. In total there are only 7 variations so far.

What I had in mind is adding a value for each of those 3 attributes (price / milage / year) so I can refer to a specific function (is that the right name?) for each value.

This part of your example I still don't understand. It might be the naming of dictionaries that confuses me.

The cool_bikes/rad_bikes dictionaries are specific searches. So 1 site can use the first one while another one could use the 2nd one. Right?

What happens next is the part where I get lost. For each site that has those options, you fill brand with the key of the value that the site matches. Am I lost now or is this correct?

If I'm still on track that'd mean I'd have to do a for loop for each attribute I wanted to check right?

I'm sure it's my lack of understanding but it seems a bit counter intuitive to me :)


quote:

I don't actually do much python so I'm trying not to push classes too hard, but those are what I'm used to. A class can be like a fancier dictionary that can hold properties like selector strings, it can provide methods that do more involved processing and return a value (like maybe some more involved logic to find an awkwardly defined element on the page), and you can make sure that different classes have the exact same interface so they'll always have a get_brand(html_page) method (or whatever) that you can call every time in your loop. But if you're not familiar with them you'll probably want to learn about them first - but it's the same idea in the end, code reuse and splitting off the specifics into separate components you can just plug in to your main code

Sorry if none of that makes much sense, I didn't get any sleep :v:

I was thinking about creating methods and/or classes for the different options and track which class I should call for each dealer. That'd make creating additional options less of a hassle in the long run I think.

But since I'm not familiar with either I really need to read up and learn about those before actually doing something with them.

LochNessMonster
Feb 3, 2005

I need about three fitty


baka kaba posted:

Well I don't completely get the deal with the combinations of attributes or whatever, so it's hard to visualise exactly what you're pulling out and how it can vary. cool_bikes is just a bunch of info specific to the Cool Bikes site, with the specific selector queries you need for each bit of data you're scraping. So while you're scraping that site, you refer to that dictionary for all the info you need.

It's just a way of standardising it, like creating a form and filling out the specific lookup info for each site, then referring to that as you pull out the stuff. You'd have one for each site, probably even if two are identical, just to keep things simple and organised and independent

I can imagine it's difficult to understand as you can't see the html source I'm trying to parse. I (now) understand the general idea you were showing me and I think I found a way to make that work, so thanks for that!

LochNessMonster
Feb 3, 2005

I need about three fitty


Another (likely dumb) question about something I don't quite understand.

I figured I'd take the following route with parsing different variations of a specific webstore template. Each dealer has a specific variation of div/span classes in which the price, milage and year info is stored. Since there are only a few options per value I'd like to scrape, I figured I'd write a loop that checks which dealer uses which variation. I store which dealersite uses which variation in dictionary with the dealername as key, and the values in a tuple.

Python code:
dealerInfo = {"alpha": "(2, 1, 2)",
	      		"bravo": "(3, 2, 3)",
	      		"charlie": "(2, 1, 2)",
	      		"delta": "(2, 1, 1)",
			}
Later on I base which beautifulsoup search should be done on the flags in this dict.

Python code:
for dealerName, flags in dealerInfo.items():

<some code>

#getPrice dependant on flag[1]
                if flags[1] == "1":
                    getPrice = vehicle.find("span", {"class": "price"})
                elif flags[1] == "2":
                    getPriceDiv = vehicle.find("div", {"class": "vehicleSpecs"})
                    getPrice = getPriceDiv.find("span", {"class": "price"})
                elif flags[1] == "3":
                    getPriceDiv = vehicle.find("div", {"class": "vehiclePrices"})
                    getPrice = getPriceDiv.find("span", {"class": "price"})
                elif flags[1] == "4":
                    getPriceDiv = vehicle.find("div", {"class": "vehiclePrice"})
                    getPrice = getPriceDiv.find("span", {"class": "price"})
This works perfectly fine , but I figured the values in the dictionary would be flag[0], flag[1] and flag[2]. That however does not return the values I'd like. It returns: "(", "2" and "," respectively. So it seems the value of the dictionary is being treated as a string instead of a tuple. Which works, as long as I take the correct position of the string, but it's not what I had in mind.

fake edit: while I'm typing this I'm realizing my error. If I don't wrap the tuple in quotes, and remove the quotes from the if flags[x] == "value" it treats the tuple as a tuple instead of a string...

:ughh:

LochNessMonster
Feb 3, 2005

I need about three fitty


So now I've built my scraper/parser and put the data in a sqlite3 db, but now comes the presentation.

What's the easiest way to make a website do queries on the db to show the data. Can I do that with python too, or should I look into JavaScript for that?

LochNessMonster
Feb 3, 2005

I need about three fitty


Thermopyle posted:

Ahh, well if I knew you wanted to make a website from the data I would have started you on Django from the start.

Since you've already got it in sqlite with your own schema, I'd probably use Flask.

No Javascript required.

When I started I didn't have a real idea what to do, I'm just trying to learn and think qbout new things to do when I've finished something.

I was reading about Django and Flask a bit earlier and I was leaning towards Django as it appears to be a large framework which might suit additional features I might think about later. One of the thinks I'd like to do is pricetracking and visualize that with a graphing framework or something.

Flask seems to be in really early development (version 0.10). Does that matter?

LochNessMonster
Feb 3, 2005

I need about three fitty


Django sounds awesome, but it might be overkill for a newbie like me.

Will I be able to add other functionality (like creating graphs) to a Flask web application later on?

LochNessMonster
Feb 3, 2005

I need about three fitty


Ok great to hear. Flask it is!

I noticed BigRedDot post some stuff about Bokeh in this thread, and that actually gave me the idea. It looks awesome and I can't wait to try it out, but first Flask.

LochNessMonster
Feb 3, 2005

I need about three fitty


BigRedDot posted:

Let me know if I can answer any questions

I will, thanks!

LochNessMonster
Feb 3, 2005

I need about three fitty


Dr Monkeysee posted:

If you really wanted to expand your tech exposure you could expose your db as an HTTP web service (in which case I'd recommend Flask and the plugin Flask-RESTful) and build the website client-side with React or Angular or something. I wouldn't necessarily recommend this over a server-side web site but it's a different approach that is widely used in industry today.

I was considering this when looking into the differences between django and flask. And making each functionality a microservice (sorry for buzzwords) actually sounds like a great idea since I have no clue what I'll be doing next. It's just a hobby project I started and keep expanding. Making components small and (relatively) easy to change sounds like a good idea. That said, I'll first try and get a default flask application running. After that I'm gonna experiment with a RESTful api. When I'm ready I think I'll need to head over to the front end thread and start a religious war by asking whether to use React or Angular.

LochNessMonster
Feb 3, 2005

I need about three fitty


I'm trying to generate a page for each dealer in my sqlite db, which kinda works, but when trying to display all vehicles for that dealer I'm running into some issues.

I think my 2nd sql query isn't picking up the dealer name. And if it would, I'm not sure if it'd actually return all vehicles for each page. I'm kinda stuck on how to figure this out.

Python code:
@app.route('/dealers/<dealerName>/')
def show_dealer(dealerName):
    dealerName = query_db('select distinct dealer from motorcycles')
    for dealer in dealerName:
        dealerVehicles = query_db("select * from motorcycles where dealer = %s") % (dealer)
    return render_template('layout.html', dealerName=dealerName, dealerVehicles=dealerVehicles)

LochNessMonster
Feb 3, 2005

I need about three fitty


Tigren posted:

Your string substitution is out of place.


dealerVehicles = query_db("select * from motorcycles where dealer = %s" % (dealer))


If you're using python3, take a look at .format() instead. https://pyformat.info/

I'll read up on that, thanks. I have done several tutorials and didn't come across string substitution yet, so I'v been googling that part. I'm indeed using python3 so that's probably my first issue.

I noticed you mentioned my for loop overwriting the variable with each iteration but changed your reply so it's gone. That was a relevant remark as well though, I'm kinda ashamed I didn't notice that myself.

LochNessMonster
Feb 3, 2005

I need about three fitty


Symbolic Butt posted:

Just a heads up for good programming practices, don't do string interpolation on sql queries:
Use the database library interface instead:

Python code:
dealerVehicles = cursor.execute("select * from motorcycles where dealer = ?", [dealer])
Read up on sql injection and prepared statements to know more about why this is important for eventual security and performance concerns.

my query_db function is based on the following functions, so I am using cursor.execute, just calling it with another function. For some reason I just can't get it to work.

Python code:
#db functions
def connect_db():
    rv = sqlite3.connect(app.config['DATABASE'])
    rv.row_factory = sqlite3.Row
    return rv

def get_db():
    if not hasattr(g, 'sqlite_db'):
        g.sqlite_db = connect_db()
    return g.sqlite_db

def query_db(query, args=(), one=False):
    cur = get_db().execute(query, args)
    rv = cur.fetchall()
    cur.close()
    return (rv[0] if rv else None) if one else rv

#url functions
@app.route('/dealers/<dealerName>/')
def show_dealer(dealerName):
    dealerName = query_db('select distinct dealer from motorcycles')
    dealerVehicles = dict()
    for dealer in dealerName:
        vehicles = query_db("select * from motorcycles where dealer = ?", [dealer])
        dealerVehicles[dealer] = vehicles
    return render_template('dealer.html', dealerName=dealerName, dealerVehicles=dealerVehicles)
Error message I get from Flask is:

code:
File "/path/to/project/scraper-frontend.py", line 36, in query_db
    if not hasattr(g, 'sqlite_db'):
        g.sqlite_db = connect_db()
    return g.sqlite_db
 
def query_db(query, args=(), one=False):
    cur = get_db().execute(query, args)
    rv = cur.fetchall()
    cur.close()
    return (rv[0] if rv else None) if one else rv
 
#Test functions
sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.
I've been trying to google the error message, but only found anwsers that say the query should be in the format you already suggested.


edit: maybe I should clearify what I'm trying to achieve a bit better. I currently have 40ish dealers. For each dealer I'd like to have a url ../dealers/<dealerName> which displays all vehicles for this dealer.

I still have a feeling I'm doing this completely wrong, but can't figure out what.

edit2: I managed to get it working with (dealer) instead of [dealer]

Python code:
vehicles = query_db("select * from motorcycles where dealer = ?", (dealer))
It now returns sqlite3.Row.object, so I just need to make sure to pick the columsn I'd like to display.

And it displays all vehicles/dealers on each page, but that's something I need to fix in the jinja template.

LochNessMonster fucked around with this message at 10:51 on Nov 3, 2016

LochNessMonster
Feb 3, 2005

I need about three fitty


HardDiskD posted:

I'm also quite new to Docker but in my experience, yeah, you really don't exactly need to ensure that the container's Python environment is clear, after all that's what the Docker container is for.


I'd love to hear from someone that has more experience with Docker, because I still can't wrap my head around on how to do some things, most importantly on how to update code and few other stuff.

I've only used venv once so I probably shouldn't say anything about it. But isn't the whole point of venv to have a virtualized environment for 1 specific version of python and specific packages.

That's what you would pack into 1 container. If you have the need to run different applications with different python/package versions you'd normally do that in another container.

I'm really new to programming in general so I might be completely off here. The idea behind docker is to create microprocesses and put eacht functionality/program/service in it's own container and the containers talk to eachother.

LochNessMonster
Feb 3, 2005

I need about three fitty


Feral Integral posted:

Can somebody tell me what I'm doing wrong with this simple re?

code:

asdf = re.findall(r'^.*\[(.*)\].*$', 'asdfas[asdhfjksdaf]asdfas[sadcgfdsfg]asdfasdf')

I keep getting only the first match ['sadcgfdsfg'], rather than what I expect: ['sadcgfdsfg', 'sadcgfdsfg']

I assume you would like to get:
['asdhfjksdaf', 'sadcgfdsfg']

What you just typed was the 2nd hit twice, so I assume that's a copy/paste error.

Did you try the string without the ^ and/or $? Your regex currently looks for a string that:

-Starts with any character
-Contains / captures all characters between square brackets
- ends with any characters.

Unfortunately I'm phone posting so can't check it myself.

What you could try is something like this too:
code:

.*[\[(.*)\]].*


Edit: beaten like a red headed stepchild.

Adbot
ADBOT LOVES YOU

LochNessMonster
Feb 3, 2005

I need about three fitty


I'm still playing with requests and bs4 and am running into an issue I'm not sure how to solve.

The html page I contains something like this:

HTML code:
div class="out-of-stock" qtlid="12345">
                                        Currently not in stock
                                        <br><span style="display: none;" qtlid="67890">
                                            Last in stock: {value}.
                                        </span>
</div>
When I visit the page in a browser I see a date / time where the html code says {value}. I assume this is something that should be filled by a JavaScript. Looking at the source I probably know (the path to) the script whch does this.

Is there an easy way to let my python app run the specific JavaScript so I can scrape the date/time?

  • Locked thread