|
Not sure if this belongs here, but I'm starting to learn programming with Python and am trying to build a webscraper that stores information in some sort of a database. Planning to scrape like 10 different small websites with items and a handfull of attributes. What database would be easy to use with Python? I was thinking something like CouchDB, or SQLite.
|
# ¿ Sep 3, 2016 09:10 |
|
|
# ¿ May 8, 2024 08:16 |
|
Nippashish posted:SQLite is easiest because it doesn't require a database server, the catch is you can't have multiple concurrent writers to a sqlite database so if you want your scraper to be multithreaded then you should use something else. If you're just starting programming then you are probably not writing a multithreaded scraper anyway, so you should use SQLite. You might also want to look into sqlalchemy, which is an awesome library for working with all kinds of databases (SQLite included), although it's kind of big and complicated and maybe not very easy to set up for a beginner, especially if you've never worked with something similar before. I don't think I'll make it multithreaded, definately not straight away. I'm familiar with databases and can make intermediate sql queries so I might take a look at sqlalchemy as well. The last time I did some programming was in college (java, 15 years ago or something). So besides the general concepts I'm completely new. Especially to python. Is it ok to post my code here to let others have a look at it and point out design flaws and/or improvements?
|
# ¿ Sep 3, 2016 10:49 |
|
Thanks for the advice. I did look at the project.log but felt kinda silly for starting one as it's my first ever project. I currently have 1 long python script that scrapes a site, turns it to a beautifulsoup object and parses information I want. For now I print the output to stdout but I want to start storing it, hence my db question. The code is starting to get longer so I was contemplating on making different scripts. One scraper, one paraser and ome that writes stuff to database. I don't want to hammer sites and scrape 10 pages 20 times in a row to test parsing code for example. My biggest question is if I should create methods and/or classes to seperate code. I havent quite figured out when I should use those.
|
# ¿ Sep 3, 2016 12:41 |
|
Dex posted:also do yourself a favour and try to get in the mindset of test driven development. it might seem like serious overkill for a first project("why would i need a test for something that just returns a url string? waste of time!"), but i find writing tests up front helps me reason about the flow of an application a lot easier than banging out a bunch of code that just does it first and trying to make it sane later. Any guides/tutorials/guidelines I can look into on that? What I've done so far is just test every minor thing I'm trying to build as a seperate standalone app with static and/or test data. That and print every variable/output/whatever I'm trying to do to confirm it does what I want it to do. tef posted:i'd recommend sqlite, but i'd also suggest maybe peewee, depending how much sql you want to pick up I already found out about requests, it sure makes my life easier. I'll look into lxml and peewee, but my main goal is to learn python, so I'm not sure if peewee will get a fair chance onionradish posted:You might take a look at vcr.py, which gives an easy way to automatically '"record and play back" HTTP traffic so you can run your script/parser as much as you want without hitting the servers while you're testing or doing development on the parser. I currently write the files to disk and use those files to run my parsing tests again. My parser is pretty straight forward at the moment. I get the first page, see how many pages there are and get those too. I'm not sure how vcr will help me do this easier (without making it too complex for me). Looking at your code snippets confuses me as I have no idea how to use it in my current code. I really am a beginner at the bottom level here.
|
# ¿ Sep 3, 2016 14:58 |
|
Not sure what's going wrong here. code:
if I change the last line to: code:
it does the trick. But I don't actually want to have files called something<name>. Any clues what I'm doing wrong? The error message I get is: code:
Of course I figure out what's wrong seconds after posting. the "something" is actually the subdirectory I'm opening the .data files from (/home/dir/project/data/htmlpage*.data). Without including the path, it can't find the files. I guess I just need to find out how to tar files without the path data. LochNessMonster fucked around with this message at 21:17 on Sep 12, 2016 |
# ¿ Sep 12, 2016 21:13 |
|
Master_Odin posted:You could use os.chdir() to set your working path to "something" and then you could just use the name of the file. I fixed it with: tar.add(str("data/" + name), arcname=str(name))
|
# ¿ Sep 12, 2016 21:58 |
|
So my (newbie) webscraping project is actually making some progress. I'm scraping a site with motorcycles and would like to store them in a sqlite3 database but am not sure on how to proceed. I can get the information I want, which is the brand, type, mileage, year and the dealer who sells it. For each entry I scrape I put the values in variables. For now I just want to write them to disk or database, but in the end I'd like to identify each one of them uniquely (unfortunately license plates are usually not listed, so I need to think of something for that...). As for structuring the data I've been doing some reading but I can't really see what I should use to store this info. Do I use a list, tuple or dictionary? Ideally I'd parse the info for 100-200 items with 5 attrinutes each. Would it be a good idea to create lists in a list and the order of the items relates to the value inside the list? Like the first item is brand, 2nd type, 3rd milage, etc? And then write them to sqlite with list.pop or something? I'm really eager to hear how you approach questions like these because I don't really know how to proceed.
|
# ¿ Sep 15, 2016 21:08 |
|
Thermopyle posted:Use a database. A database has tables. Each table is kind of like a spreadsheet. So you'd have a field (column in spreadsheet) for each attribute. I'm definately going to write the data to a database. What I'm struggling with is how to structure the data with my script before writing to disk. The program currently scrapes html pages that are saved to disk. Each page has 10 vehicles in it, so I loop through the divs parsing the values for brand/type/milage/year/dealer. That's where I am right now. To get that data to a database I figured I should probably "store" that data inside the script for at least all vehicles on 1 page, or for all vehicles on all the pages. When I have parsed 1 page (and thus 10 vehicles) I write all of them to the database at once, instead of opening (and closing) a db connection for each iteration of the page loop(s). So I think I need to put the vehicle info in a list, tuple or dictionary before writing it to the database. I could be missing something obvious though.
|
# ¿ Sep 15, 2016 21:59 |
|
Tigren posted:That's exactly what dicts and lists are for. So you could have a list called list_of_motorcycles and each motorcycle could be a dict, which links keys to values. Once you have parsed all the information for one motorcycle, append that dictionary to the end of your list_of_motorcycles and move to the next one. Then, at the end, you've got a big ol' list of motorcycle dictionaries. Thank you, this what I had in mind as a concept, but I had no clue if it would be a good and/or efficient way of doing it. I also didn't know what type I should've used. Thanks for helping me figuring this out! After I manage to do this, I can create a loop based on Suspicious Dishs sql query to write to a database.
|
# ¿ Sep 15, 2016 23:05 |
|
Master_Odin posted:If all you're doing is taking the data from the form and not doing anything meaningful with it before insertion into the DB, you may as well as just insert it into DB immediately. There's no penalty leaving the database connection open for the length of scraping, especially since it's sqlite. I'm really new to programming so I'm taking baby steps. In the near future I will be doing stuff with the data before inserting it. Good to know I could keepthe connection open if I for some reasonneed to in the future though.
|
# ¿ Sep 16, 2016 01:03 |
|
Dex posted:you could also use sqlalchemy for this. It looks a bit more complicated, is it something I should be able to do as a complete beginner? What are the pros/cons compared to the plain sqlite3 method?
|
# ¿ Sep 16, 2016 17:28 |
|
Thermopyle posted:Use plain sqlite for now. Was just reading about sqlalchemy and while it looks like overkill for now, I might indeed need it in the future. I'll start fiddling with sqlite for now. edit: I must say I'm starting to enjoy this a lot more than I thought I would. I once had a Java programming class which had a horrible teacher (several years consistant 90% of the class failing) and thought programming sucks balls. A few years back I was trying to find a specific brand/type motorcycle and I was quite annoyed I had to manually check a lot of dealer sites to see if they had any, so I figured that'd be a nice project to start with. I currently now have a scraper/parser working that gets roughly 100 vehicles from 1 dealer who has a standard webgallery which I've seen several other dealers use as well. Hopefully I can get the data of another 10-20 dealers this way. I'm pretty sure I'll be running into a lot of design flaws in a while, but figuring those out seems like half the fun. LochNessMonster fucked around with this message at 19:59 on Sep 16, 2016 |
# ¿ Sep 16, 2016 17:41 |
|
I'm probably going to figure this out within 5 minutes after posting, but I've been struggling with inserting data into sqlite 3 and I'm not quite sure why. code:
code:
|
# ¿ Sep 21, 2016 08:54 |
|
ArcticZombie posted:I've never used sqlite3 so I'm winging this a bit, but I don't think your problem is with sqlite3. Looping over a dict doesn't work like that. The for loop will loop over each key in the dict: The test dictionary only contains 1 entry, but the real resultset will have tens of entries. I just started out with a smaller data set to spot problems early on. Your example should get me up and running though, I'm try and write something that inserts just the test dictionary and then see how I can write a loop around it so I can repeat it for multiple entries.
|
# ¿ Sep 21, 2016 09:47 |
|
ArcticZombie posted:The real results will be a list of dicts, no? You'll loop over the list, not over the dicts. yeah, you're right. It's a list containing dictionaries. I'm really new to this, and still struggling a bit with the different types of data structures (and how to proces them)
|
# ¿ Sep 21, 2016 11:34 |
|
Tigren posted:It looks like other people squared you away on your for loop. I wanted to show you this option for inserting from a dict as well. Instead of the question mark "parameterization", you can use named placeholders. Those names are the keys of your dict. Both options seem like a pretty efficient way of doing things. I didn't get the for loop working yet so I'll give this a try as well. Just for my understanding, the: in/out [number] is from your python environment?
|
# ¿ Sep 21, 2016 19:43 |
|
When using the for loops I keep getting the same error, no matter what I try. When running this: Python code:
code:
So I went and tried Tigrens suggestion on how to tackle this. It doesn't give any errors, but for some reason it doesn't really enter the values either. When I do a SELECT * FROM motorcyles; it only returns 1 bogus value I added manually. Python code:
LochNessMonster fucked around with this message at 10:19 on Sep 22, 2016 |
# ¿ Sep 22, 2016 09:54 |
|
Hammerite posted:Do you need to commit the transaction? I feel pretty stupid now...
|
# ¿ Sep 22, 2016 11:29 |
|
Thanks for all the feedback. I managed to get it working with Tigrens way. The code is now looking like this: Python code:
For now it's still a lot of trial and error, and thinking about what I'm doing wrong but when I started a few weeks back I knew nothing at all (compared to you guys I guess that's still the case) but now I seem to find my way a bit. LochNessMonster fucked around with this message at 17:26 on Sep 22, 2016 |
# ¿ Sep 22, 2016 17:19 |
|
Eela6 posted:LochNess, glad to see you're enjoying python. It is by far my favorite language. Thanks for the feedback, could you explain what is actually wrong and how I could improve on that? I thought I'd catch the errors for that specific code and print it to stdout. That way I knew what went wrong.
|
# ¿ Sep 23, 2016 12:28 |
|
Reading these explenations and those links clears things up a bit (or a lot actually). The reason I initially did it this way was a) because I didn't know any better, and b) I had no idea which other (specific) Errors I was looking for. The reason I put it in there, was because my program seem to get all the input and put it in dictionary and added each dictionary to the list. But it didn't write the data to the database. This actually happened due to several reasons, some of which I found out with except Exception. Reasons were: database was in a different path than my program was looking in, table had a typo so couldn't insert and finally I didn't commit the changes. What I didn't realize is that I could possibly catch lots of other Exceptions. So I'm going to rewrite this with trying to catch a sqlite3.DatabaseError for the connection, and a sqlite3.IntegrityError to see if my insert is working properly (if it isn't, I don't want to pollute the database).
|
# ¿ Sep 23, 2016 16:01 |
|
SurgicalOntologist posted:I'll just add that if you're only printing out the error message, you're not getting any more information than if you had just let it hit. Eela6's suggestion is a good reason to catch exceptions while debugging but for it to provide benefits you need to print out some other useful information after catching. E.g. the value of some variables that may be related to the error. Finally, as in Eela6's example, you should still re-raise the exception if you're only catching it for debugging purposes (that is, if you don't do some error-handling that will allow your program to continue regardless). I still have to wrap my head around the raising and reraising you guys keep mentioning. I did print out a lot of other variables (and sometimes type(var) for the bs4 variables) tp get an idea what was going wrong. I cleaned that part up before posting though.
|
# ¿ Sep 23, 2016 17:12 |
|
My web scraper project for scraping motorcycle dealers site is coming along nicely. I've noticed that there's a large majority of dealers using about 4 different inventory modules. I"m currently focussing on one of them, and the scraping is coming along nicely. The parsing of the data is also coming along properly, but it appears that almost every dealer has some small modification in the way the divs or spans are named. For example the vehicle price can be found in <div class=price>, <div class=vehiclePrice>, <div class=vehiclePrices> or even <span class=vehiclePrice>. So plenty of variations, and this is not just the case for Price, but happens for Milage and Year as well. What started as "oh well, I'll just create one parser per dealer" is now becoming a bit annoying since 99% of the code for each dealer appears to be the same, but I have a hard time figuring out how to move away from 1 parser per dealer and making this into a single parsing application. I was hoping anyone has an idea on how to solve this. There are so many variations floating around I'm keeping a spreadsheet tracking on in which divs/spans each dealer keeps their attributes. I'm sure there's a smarter solution for that, but I can't really come up with something. Python code:
|
# ¿ Oct 6, 2016 16:24 |
|
Extortionist posted:Why store that information in a spreadsheet instead of in a dictionary in the code or in the db, which the script could then refer to when doing the parsing? Unfortunately the divs vary in position as well, and have a multitude of different parent divs, so regexes are out of the question. I could create some sort of dictionary with all different options, and then keep track which options work for each dealer. Really need to think this through because it appears to be a bit complicated. Sometimes I need to get a specific parent div, because the span class I'm looking for actually occurs multiple times in the div that contains all vehicle info. I'm not even sure if my spreadsheet is 100% correct, so I guess I need to start take inventory first. I quickly checked for Price, and that only has 3 variations by itself. code:
baka kaba posted:I'm not sure you can really get around it, if everyone's doing things slightly differently then the best you can do is be as smart as possible with your selector/parsing code. You could try and generalise and get some definitions that work on every page, but honestly that just makes things complicated when they inevitably change something and create some new cases you have to catch. Maybe try making a separate dict or whatever for each page, just to handle getting the set of elements, and write a common scraper that just uses the appropriate class to get the data it needs to work on I'm pretty new to python (this i my first project, I'm slowly but steadily expanding my knowledge) but I haven't looked at methods or classes yet. Could you explain what your code snippet does exactly, I'm a bit at a loss here... edit: to clarify, I see what you're doing, but I don't really understand how to apply that to my code. LochNessMonster fucked around with this message at 18:05 on Oct 6, 2016 |
# ¿ Oct 6, 2016 18:03 |
|
subway masturbator posted:A department of my job also works with parsing data scraped from disparate websites and they learned that even if the sites are similar or share some parsing code, you are better off treating each site as a special snowflake. The dealers I'm currently looking at all use the same sort (of the shelves) inventory software. They just configured it differently apparently (or are working with slightly different versions). I just looked at all the scripts I have, and so far I have 20 dealers, and in total 7 different variations. It's not as bad as I thought. There are 4 variations on Price, 2 on Milage and 3 on Year. Would it be an idea to expand my dealerDict and adding flags for these variations, and then create variables with the values for each price, milage and year flag? Python code:
|
# ¿ Oct 6, 2016 18:47 |
|
baka kaba posted:(sorry, I screwed up and didn't use dict[key] notation for pulling out the selector strings in the dictionary!) Thanks for explaining, this makes a lot more sense to me now. CSS Selectors seem like a good (better) solution than what I'm currently doing. quote:The reason that's useful is you can create that dict for one site, with all the selectors you need for each piece of data. Then you can create another dict for another site, with its own special snowflake selectors, but using the same key names. That way your main code (that does the selecting) can use the exact same calls I'm not sure if creating a dictionary per site is the way I want to go for now. I have 20 sites so far and have 3 attributes I'd like to get which each have 4, 2 and 5 options. In total there are only 7 variations so far. What I had in mind is adding a value for each of those 3 attributes (price / milage / year) so I can refer to a specific function (is that the right name?) for each value. This part of your example I still don't understand. It might be the naming of dictionaries that confuses me. The cool_bikes/rad_bikes dictionaries are specific searches. So 1 site can use the first one while another one could use the 2nd one. Right? What happens next is the part where I get lost. For each site that has those options, you fill brand with the key of the value that the site matches. Am I lost now or is this correct? If I'm still on track that'd mean I'd have to do a for loop for each attribute I wanted to check right? I'm sure it's my lack of understanding but it seems a bit counter intuitive to me quote:I don't actually do much python so I'm trying not to push classes too hard, but those are what I'm used to. A class can be like a fancier dictionary that can hold properties like selector strings, it can provide methods that do more involved processing and return a value (like maybe some more involved logic to find an awkwardly defined element on the page), and you can make sure that different classes have the exact same interface so they'll always have a get_brand(html_page) method (or whatever) that you can call every time in your loop. But if you're not familiar with them you'll probably want to learn about them first - but it's the same idea in the end, code reuse and splitting off the specifics into separate components you can just plug in to your main code I was thinking about creating methods and/or classes for the different options and track which class I should call for each dealer. That'd make creating additional options less of a hassle in the long run I think. But since I'm not familiar with either I really need to read up and learn about those before actually doing something with them.
|
# ¿ Oct 6, 2016 20:15 |
|
baka kaba posted:Well I don't completely get the deal with the combinations of attributes or whatever, so it's hard to visualise exactly what you're pulling out and how it can vary. cool_bikes is just a bunch of info specific to the Cool Bikes site, with the specific selector queries you need for each bit of data you're scraping. So while you're scraping that site, you refer to that dictionary for all the info you need. I can imagine it's difficult to understand as you can't see the html source I'm trying to parse. I (now) understand the general idea you were showing me and I think I found a way to make that work, so thanks for that!
|
# ¿ Oct 7, 2016 11:00 |
|
Another (likely dumb) question about something I don't quite understand. I figured I'd take the following route with parsing different variations of a specific webstore template. Each dealer has a specific variation of div/span classes in which the price, milage and year info is stored. Since there are only a few options per value I'd like to scrape, I figured I'd write a loop that checks which dealer uses which variation. I store which dealersite uses which variation in dictionary with the dealername as key, and the values in a tuple. Python code:
Python code:
fake edit: while I'm typing this I'm realizing my error. If I don't wrap the tuple in quotes, and remove the quotes from the if flags[x] == "value" it treats the tuple as a tuple instead of a string...
|
# ¿ Oct 15, 2016 14:16 |
|
So now I've built my scraper/parser and put the data in a sqlite3 db, but now comes the presentation. What's the easiest way to make a website do queries on the db to show the data. Can I do that with python too, or should I look into JavaScript for that?
|
# ¿ Oct 16, 2016 10:17 |
|
Thermopyle posted:Ahh, well if I knew you wanted to make a website from the data I would have started you on Django from the start. When I started I didn't have a real idea what to do, I'm just trying to learn and think qbout new things to do when I've finished something. I was reading about Django and Flask a bit earlier and I was leaning towards Django as it appears to be a large framework which might suit additional features I might think about later. One of the thinks I'd like to do is pricetracking and visualize that with a graphing framework or something. Flask seems to be in really early development (version 0.10). Does that matter?
|
# ¿ Oct 16, 2016 17:41 |
|
Django sounds awesome, but it might be overkill for a newbie like me. Will I be able to add other functionality (like creating graphs) to a Flask web application later on?
|
# ¿ Oct 16, 2016 20:04 |
|
Ok great to hear. Flask it is! I noticed BigRedDot post some stuff about Bokeh in this thread, and that actually gave me the idea. It looks awesome and I can't wait to try it out, but first Flask.
|
# ¿ Oct 16, 2016 22:02 |
|
BigRedDot posted:Let me know if I can answer any questions I will, thanks!
|
# ¿ Oct 17, 2016 18:42 |
|
Dr Monkeysee posted:If you really wanted to expand your tech exposure you could expose your db as an HTTP web service (in which case I'd recommend Flask and the plugin Flask-RESTful) and build the website client-side with React or Angular or something. I wouldn't necessarily recommend this over a server-side web site but it's a different approach that is widely used in industry today. I was considering this when looking into the differences between django and flask. And making each functionality a microservice (sorry for buzzwords) actually sounds like a great idea since I have no clue what I'll be doing next. It's just a hobby project I started and keep expanding. Making components small and (relatively) easy to change sounds like a good idea. That said, I'll first try and get a default flask application running. After that I'm gonna experiment with a RESTful api. When I'm ready I think I'll need to head over to the front end thread and start a religious war by asking whether to use React or Angular.
|
# ¿ Oct 18, 2016 08:07 |
|
I'm trying to generate a page for each dealer in my sqlite db, which kinda works, but when trying to display all vehicles for that dealer I'm running into some issues. I think my 2nd sql query isn't picking up the dealer name. And if it would, I'm not sure if it'd actually return all vehicles for each page. I'm kinda stuck on how to figure this out. Python code:
|
# ¿ Nov 2, 2016 20:21 |
|
Tigren posted:Your string substitution is out of place. I'll read up on that, thanks. I have done several tutorials and didn't come across string substitution yet, so I'v been googling that part. I'm indeed using python3 so that's probably my first issue. I noticed you mentioned my for loop overwriting the variable with each iteration but changed your reply so it's gone. That was a relevant remark as well though, I'm kinda ashamed I didn't notice that myself.
|
# ¿ Nov 2, 2016 21:24 |
|
Symbolic Butt posted:Just a heads up for good programming practices, don't do string interpolation on sql queries: my query_db function is based on the following functions, so I am using cursor.execute, just calling it with another function. For some reason I just can't get it to work. Python code:
code:
edit: maybe I should clearify what I'm trying to achieve a bit better. I currently have 40ish dealers. For each dealer I'd like to have a url ../dealers/<dealerName> which displays all vehicles for this dealer. I still have a feeling I'm doing this completely wrong, but can't figure out what. edit2: I managed to get it working with (dealer) instead of [dealer] Python code:
And it displays all vehicles/dealers on each page, but that's something I need to fix in the jinja template. LochNessMonster fucked around with this message at 10:51 on Nov 3, 2016 |
# ¿ Nov 3, 2016 08:26 |
|
HardDiskD posted:I'm also quite new to Docker but in my experience, yeah, you really don't exactly need to ensure that the container's Python environment is clear, after all that's what the Docker container is for. I've only used venv once so I probably shouldn't say anything about it. But isn't the whole point of venv to have a virtualized environment for 1 specific version of python and specific packages. That's what you would pack into 1 container. If you have the need to run different applications with different python/package versions you'd normally do that in another container. I'm really new to programming in general so I might be completely off here. The idea behind docker is to create microprocesses and put eacht functionality/program/service in it's own container and the containers talk to eachother.
|
# ¿ Dec 2, 2016 21:49 |
|
Feral Integral posted:Can somebody tell me what I'm doing wrong with this simple re? I assume you would like to get: ['asdhfjksdaf', 'sadcgfdsfg'] What you just typed was the 2nd hit twice, so I assume that's a copy/paste error. Did you try the string without the ^ and/or $? Your regex currently looks for a string that: -Starts with any character -Contains / captures all characters between square brackets - ends with any characters. Unfortunately I'm phone posting so can't check it myself. What you could try is something like this too: code:
Edit: beaten like a red headed stepchild.
|
# ¿ Dec 7, 2016 19:34 |
|
|
# ¿ May 8, 2024 08:16 |
|
I'm still playing with requests and bs4 and am running into an issue I'm not sure how to solve. The html page I contains something like this: HTML code:
Is there an easy way to let my python app run the specific JavaScript so I can scrape the date/time?
|
# ¿ Jan 25, 2017 20:01 |