|
I have no idea where to ask this, so apologies in advance if this is completely not the place. I've built a tool in Python to scrape a certain wiki for a certain game. The goal here is to bring a bunch of pages down locally and transform the HTML to Markdown files to be read in with Obsidian. Obsidian has this incredibly cool graph view that shows pages and their relationships visually, which is really the whole point of this project, to see sort of a bird's eye view of everything and maybe make mental connections between things that aren't immediately obvious. I'm scraping about 3000 pages, bringing down one smallish image per page, totaling about 130MB. I'm caching images so I'm not re-downloading them every time I scrape a page for text updates. I'd like to make this tool publicly available, to give content creators (or anybody really) a starting point for their own note-taking and theorycrafting for said game. Starting with an Obsidian Vault containing all enemies/locations/NPCs, etc., already populated and linked would be helpful to people I think (it has been for me, anyway). But I'm concerned about the legality of, and my own liability in, releasing such a tool. This wiki is owned and operated by a Chinese company that runs a whole host of game-specific wikis. Their Terms of Service prohibit scraping without express written consent. In my own research it looks like webscraping is not illegal, but I could be liable for breach of contract of their terms of service. It's one thing for me to be scraping their website a couple times a day while I've been developing this thing, but it's another to release this tool into the wild and have more people hitting their servers for 130MB's worth of bandwidth a pop. I'm not making any assumptions about how (un)popular this tool would be, it's pretty niche, but I'm concerned all the same. One idea I had was to omit any explicit reference to the wiki, and have the first line of a config file be WIKI_URL= with the value blank, and give a little wink wink nudge nudge in the readme for the user to fill in the blank with their favorite wiki for this game. Still, seems dicey. Am I opening up a can of worms here that should be left unopened?
|
# ? Nov 29, 2022 19:31 |
|
|
# ? May 15, 2024 03:45 |
|
There might be a few things you can do, and I guess it depends on how it’s licensed too. Making a HEAD request and checking the Last-Modified header will let you know of yours is stale, if the header is accurate. There might also be a page like https://en.m.wikipedia.org/wiki/Special:RecentChanges?hidebots=1&hidecategorization=1&hideWikibase=1&limit=50&days=7&urlversion=2 where you can collect only the pages changed since the last run. They might have the semantic media wiki extension https://en.m.wikipedia.org/wiki/Semantic_MediaWiki which is used by liquipedia, but I never figured out how to use it. I believe if they use the creative commons license, you can rehost your compiled copy of it without anyone else needing to run it. e: also I’m pretty sure the worst they can do is ban you, but you probably want to avoid that still. If you space it out enough there’s no way they’ll care about it, it’s likely background noise even if others start running it. It’s still a good idea to make it as efficient as you can though, because that’s good for them, other users, and also you. dougdrums fucked around with this message at 04:10 on Nov 30, 2022 |
# ? Nov 30, 2022 04:01 |
|
I don't think that there's any issues with writing the code or making it publicly available under an open source license. Make the url an input argument, use a config file to store the last-used input as the default for the next invocation. I am not a lawyer, this is my opinion and not legal advice
|
# ? Nov 30, 2022 04:40 |
|
Hey our attorneys have handled this before back in 2019. I am not a lawyer and this is not legal advice. Generally scraping websites as though you’re a human is permissible. Whether there’s civil liability depends on a lot of things potentially including whether your actions cause financial damage to the company. Whether the ToS is binding is an “it depends” as well. Forming an LLC can be under $200 and may indemnify you from the worries. If they sue they’re suing an LLC with no money. CarForumPoster fucked around with this message at 05:27 on Nov 30, 2022 |
# ? Nov 30, 2022 04:50 |
|
Fwiw, I always put an email address in the user agent, and I’ve only been asked to stop once out of all the times I made things like this. I wrote a client for predictit’s “undocumented” api some years ago. I never released it or used it to make more calls than the site itself, so I take it they just didn’t want people using automated tools.
|
# ? Nov 30, 2022 05:35 |
|
Is it just me or is Pipenv fragile as gently caress? Although in my curmugeonly way I had to be dragged into it kicking and screaming, once I stopped and learned it, I love the whole idea. Its a pretty good way of specifying a repeatable python environ for dev and deploy. But good loving grief is it temperamental. I'll have something that works great on dev, and then when deployed it just explodes in a ball of python errors.
|
# ? Dec 1, 2022 12:12 |
|
I thought everyone moved onto Poetry now? Not that I know if it's any better...
|
# ? Dec 2, 2022 05:01 |
|
Thanks for the responses above! Some very good ideas here on saving bandwidth I hadn't considered. CarForumPoster posted:Forming an LLC can be under $200 and may indemnify you from the worries. If they sue they’re suing an LLC with no money. I'm thinking this might not be a bad thing to do regardless, and something I've been meaning to do at some point anyway if I maybe want to consider making some money off of these dumb little projects of mine.
|
# ? Dec 2, 2022 20:58 |
|
Am I wrong for wanting to implement immutable objects as frozen dataclasses with cached properties? For example:Python code:
(and yes, I know nothing is truly immutable as you can call object.__setattr__ if you really want to, I'm just trying to avoid having single-underscore attributes or methods)
|
# ? Dec 3, 2022 01:26 |
|
I think you're correct.
|
# ? Dec 3, 2022 04:15 |
|
I did not know about cached_property. That looks awesome and I can’t wait to simplify some of my code with it.
|
# ? Dec 3, 2022 04:26 |
|
Zugzwang posted:I did not know about cached_property. That looks awesome and I can’t wait to simplify some of my code with it. Oh yeah that is rad.
|
# ? Dec 3, 2022 22:18 |
|
So since the OP is pretty old, what's considered the best tutorial/course for learning python for someone who's never programmed before these days?
|
# ? Dec 4, 2022 04:28 |
|
mila kunis posted:So since the OP is pretty old, what's considered the best tutorial/course for learning python for someone who's never programmed before these days? I personally recommend https://automatetheboringstuff.com/ - it's available free online but he also has a course on Udemy and a book. There's no magic bullet though, so it might not work for you, but for me it was the thing that finally made software dev click into place.
|
# ? Dec 4, 2022 07:28 |
|
+1 for ATBS. It’s not just a good intro to the language. It immediately shows you how to do truly useful things with it. (It’s also free!) Al’s follow up book to that, Beyond the Basic Stuff With Python, is also great. It’s about how to put together nontrivial, maintainable programs once you know the ins and outs of the language.
|
# ? Dec 4, 2022 07:42 |
|
Replit.com has a 100 days of Python course which talks you through many key concepts and applies then in scenarios. It's free and has videos explaining it.
|
# ? Dec 4, 2022 08:05 |
|
qsvui posted:I thought everyone moved onto Poetry now? Not that I know if it's any better... Oh god. Why are people creating even more package managers. Are people REALLY pining for the toolchain hell that is javascript THAT much? PIP works! (Oh and I suppose theres Conda. The maths people seem to like that one for some reason)
|
# ? Dec 6, 2022 03:13 |
|
I think I have like 4 different python environment managers on my computer.
|
# ? Dec 6, 2022 03:29 |
|
One day there’s gonna be a docker image for every possible python package and I’ll have 4 different tools on my computer to manage that.
|
# ? Dec 6, 2022 03:31 |
|
I just install everything into base and use only the packages that win the ensuing melee.
|
# ? Dec 6, 2022 03:36 |
|
duck monster posted:PIP works!
|
# ? Dec 6, 2022 03:52 |
|
duck monster posted:Oh god. Why are people creating even more package managers. Are people REALLY pining for the toolchain hell that is javascript THAT much? Pip works well for pure python packages, conda works well if you also have non-python dependencies. The Intel MKL is a faster binding for numpy, sometimes much faster, and while there are instructions for compiling numpy from source with the mkl (which you of course installed separately) gently caress actually doing that, and quadruple-gently caress doing that in Windows lmao
|
# ? Dec 6, 2022 10:20 |
|
QuarkJets posted:Pip works well for pure python packages, conda works well if you also have non-python dependencies. The Intel MKL is a faster binding for numpy, sometimes much faster, and while there are instructions for compiling numpy from source with the mkl (which you of course installed separately) gently caress actually doing that, and quadruple-gently caress doing that in Windows lmao Yeah I get that with Conda. All those hosed up nvidia and intel binary blobs that need to be rustled. Or f*#?king GDAL lol. I just wish Pipenv was fit for purpose. I like the idea. Its just.... kind of broken
|
# ? Dec 6, 2022 13:34 |
|
duck monster posted:Yeah I get that with Conda. All those hosed up nvidia and intel binary blobs that need to be rustled. Or f*#?king GDAL lol. Yea sometimes there’s a conda install x that works much better than a similar pip install x thing when you need binaries. When you end up deploying a docker image anyway, it can be a big help. IIRC at various times this has been the only way to successfully install one at least one OS… Selenium FastAI Scikit Learn PyTorch
|
# ? Dec 6, 2022 18:22 |
|
I guess numpy is just drat slow. In Advent of Code 2022/08 I saved data to a numpy 2d array. When I iterated through the array, two python for loops were like twice as fast than using numpy.ndarray() iterator. I began to think if more numpy functions are so slow too. I timed the parts of my program: time to do slices: 4.49991ms time for silver: 63.13467ms time for gold: 51.50747ms time total: 119.14206ms For silver I was doing this around 40000 times: code:
code:
time for silver: 39.77084ms time for gold: 40.23886ms time total: 85.01887ms ... I even tried running the whole solution 100 times with timeit. With python's max(): 9040.51970ms With numpy.max(): 12312.64960ms Edit: I got the same with numpy.amax(lf_slice), numpy.nanmax() was quite a bit slower. Edit2: I guess numpy needs a lot more data than a 10k cell 2d array. Welp. At least it's so easy to access slices compared to python's 2d array with lists. Ihmemies fucked around with this message at 18:25 on Dec 8, 2022 |
# ? Dec 8, 2022 17:38 |
|
Apologies for the bunches of text incoming. This is my first foray into pandas' "to_sql" function. I have a dataframe full of metadata pulled from salesforce. I'm trying to push this dataframe to mysql (my manager wants to use mysql because it's free, of course). The dataframe looks ok when I inspect it. The to_sql connection string works ok. But I get the following error: quote:ProgrammingError: (pymysql.err.ProgrammingError) (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 0, 0, 0, NULL, (), NULL, NULL, 0, 0, 0, 0, 'tns:ID', 1, 'id', 0, 0, 0),(1, 0,' at line 1") Of course the full error is big ol' text dump if you're morbidly curious, or think it will help: https://www.klgrth.io/paste/8795j The python code is below. The salesforce credentials are from a randomly generated instance populated with fake data, so you're free to try the below code or pulling it into a dataframe or whatever: code:
Also I tried other mysql modules (like mysqldb) to no avail.
|
# ? Dec 9, 2022 04:03 |
|
Seventh Arrow posted:Apologies for the bunches of text incoming. This is my first foray into pandas' "to_sql" function. I didn’t read but iirc % text when you’re trying to do %s screws things up. Iirc a work around is to use a text() function. I’m posting from bed so can’t Google much more than that
|
# ? Dec 9, 2022 04:41 |
|
I don't really know the answer, but since you're just using a canned dataframe method I'd guess that there's some issue in your data. I'm not about to go digging through that huge one-line dump of it so good luck with that You can help isolate the problem by just inserting one row at a time, and if it breaks on the very first row then try selecting a specific range of columns (say the first 10 of them).
|
# ? Dec 9, 2022 06:10 |
|
CarForumPoster posted:I didn’t read but iirc % text when you’re trying to do %s screws things up. Iirc a work around is to use a text() function. I’m posting from bed so can’t Google much more than that FYI OP this is a red herring, it'd only matter if you were doing your own insertions with pymysql statements instead of passing a connector to the dataframe (which is also something you can try)
|
# ? Dec 9, 2022 06:11 |
|
Also the ‘connectionless commits’ in sqlAlchemy (your df.to_sql(con=engine,…)) are deprecated, you should do this ascode:
DoctorTristan fucked around with this message at 07:07 on Dec 9, 2022 |
# ? Dec 9, 2022 06:54 |
|
numpy sucks, pandas sucks, gotta use it but it'll break so who tf knows don't use mysql ever, replace with postgres, everything will work fine then
|
# ? Dec 9, 2022 06:56 |
|
oh yeah and point out to your manager that postgres is also free
|
# ? Dec 9, 2022 07:24 |
|
Ihmemies posted:I guess numpy is just drat slow. In Advent of Code 2022/08 I saved data to a numpy 2d array. When I iterated through the array, two python for loops were like twice as fast than using numpy.ndarray() iterator. I began to think if more numpy functions are so slow too. The biggest problem with numpy is that it is very eager to create temporary arrays. Every time that you think that you're doing some operation in-place numpy is probably creating a temporary array behind the scenes to store intermediate results. So while numpy is great for writing code quickly, it's better to write your own vectorized functions with numba. But in this case it sounds like you're spending a lot of time in a python for loop, which is really the performance bottleneck here. Definitely push that into a numba kernel.
|
# ? Dec 9, 2022 07:33 |
|
QuarkJets posted:I don't really know the answer, but since you're just using a canned dataframe method I'd guess that there's some issue in your data. I'm not about to go digging through that huge one-line dump of it so good luck with that This would be time-consuming but it's definitely a good idea. I think there's ways to filter out rows in pandas directly, so I will look into that - thanks! But I highly suspect that it may have something to do with something said ˅˅˅˅ DoctorTristan posted:Also the ‘connectionless commits’ in sqlAlchemy (your df.to_sql(con=engine,…)) are deprecated, you should do this as Yes I think you're right - to give a bit of background, there are some columns where the (meta)data is supposed to represent a multi-choice field. These have square and round brackets and are separated by commas and from what little I know about database design, it's a disaster waiting to happen. Maybe the best option is to filter out all of the special characters and have them just separated by spaces. Or maybe push them to a separate table. DELETE CASCADE posted:numpy sucks, pandas sucks, gotta use it but it'll break so who tf knows QuarkJets posted:oh yeah and point out to your manager that postgres is also free noted, I will test it out and see. Thanks everyone!
|
# ? Dec 9, 2022 08:15 |
|
Also as a very general rule I’d avoid using pandas for data pipeline work like this because the ‘helpful’ inferring of column data types will frequently gently caress you over in exactly this way. Generally the safer way is to upload the data as text into a staging table and then do type conversions on the database side. (The best method is to use proper ELT tools, but from the sound of things that isn’t an option)
|
# ? Dec 9, 2022 11:09 |
|
OP I think your observations here are the payoff from the big recommendations a few weeks ago: that you should carefully inspect the data so that you understand what it contains. After a few posters suggested that this is a data problem you were able to point out that there's a multiple-choice field that contains bracketed options, and you were able to deduce that maybe that's the real cause of the issue. I think that's probably right, since the error is complaining about parens in an SQL statement that pandas created from your data. So what should you do about this, do you really need this column in the database? It sounds like these are options shown to a user in a UI maybe, that doesn't sound useful for a database but I have no idea whether you may actually need it. Drop that column if you can. Ideally you'd only query for the smallest-necessary number of columns, right now you have 57 columns and I'm guessing you don't need to put all of those in a database table but maybe I'm wrong. DoctorTristan is right to point out that there are pitfalls in using pandas, if there's more info you can extract from the SalesForce AI then you may be able to start catching these issues before they occur - deviations in data types are especially troublesome (e.g. maybe there's a field that is normally an integer but sometimes returns a string or null or some other poo poo).
|
# ? Dec 9, 2022 11:31 |
|
DoctorTristan posted:Also as a very general rule I’d avoid using pandas for data pipeline work like this because the ‘helpful’ inferring of column data types will frequently gently caress you over in exactly this way. Generally the safer way is to upload the data as text into a staging table and then do type conversions on the database side. Thanks for the advice - and yes, I think a lot of this boils down to the fact that I'm having to do a lot of stuff as a data engineer that I haven't before. It might be possible to look into different ETL solutions but they're not pushing a lot of data so something like Hadoop seems like overkill. I'll see how things go, however! I'm wondering now if pyspark has the same pitfalls as pandas, then. QuarkJets posted:OP I think your observations here are the payoff from the big recommendations a few weeks ago: that you should carefully inspect the data so that you understand what it contains. After a few posters suggested that this is a data problem you were able to point out that there's a multiple-choice field that contains bracketed options, and you were able to deduce that maybe that's the real cause of the issue. I think that's probably right, since the error is complaining about parens in an SQL statement that pandas created from your data. Thanks also, I'm learning what I can. TBH, even if my manager didn't suggest using python for this pipeline, I would have suggested it anyways because it's familiar to me and it seems like it can "do everything." But now that I'm learning about the weaknesses of pandas, maybe not! The columns are needed. The idea is that they want an always-up-to-date set of metadata in a database. There are stored procedures (which I will need to write!) that will check the client's data against this metadata (i.e., does it exceed the allowed amount of characters, are the country codes correct, etc.) and checking if it has the right "picklist" values is part of that. I think it will still work if all the values are just separated by a space and with no brackets. Ultimately it's a welcome challenge though...in my last position I didn't get to use programming at all, I'm glad that I'm at least getting to use this stuff in a practical way.
|
# ? Dec 9, 2022 14:54 |
|
I've got a (very possibly stupid) question on generators prompted by some discussion in the Coding Horrors thread: is there a way to conditionally change the order of yield statements in a loop? The inspiring post: QuarkJets posted:(...) The issue at hand is this: a colleague of mine has been getting familiar with a domain-specific package I work on as part of a bigger team. This package basically has Agents producing data from a number of Sources per agent at discrete time steps given some Inputs -- something like this: Python code:
code:
Any suggestions?
|
# ? Dec 12, 2022 17:54 |
|
The ordering in this example is somewhat arbitrary, so you could just make the time loop the innermost loop. If you wanted to give anyone the freedom of either style then you could split up the generator. There's no reason that this generator needs to accept "inputs" as an input argument, it can accept "input" instead. But don't name a variable "input" as that's a built-in
|
# ? Dec 12, 2022 18:41 |
|
|
# ? May 15, 2024 03:45 |
Anyone else using Pycharm on a Macbook? I'm doing the 100 Days of Python off of Udemy and am working thru the course in Pycharm. I've got a directory set up that has individual folders for each day. Typically each morning I go into Finder & just create a new folder (say, "Day 35"). Then I open Pycharm, select "New Project" and then navigate to that day's folder. Inevitably, when I click "Okay" on the pop-up dialog where you choose the interpretor & a few other things, it returns another pop-up telling me that the folder isn't empty & I just select "Create Project From Source". Then the project loads and the empty Icon file opens in the editor. Is there any way to ignore this file or otherwise keep it from opening immediately upon each new project creation?
|
|
# ? Dec 12, 2022 19:05 |