|
Business posted:The subject of pickling came up earlier and I am new to the whole idea of serialization so I wanted to check if what I'm doing makes sense. I am working with spaCy for NLP and I wanted to save large Doc objects (spacy thing that contains lots of NLP data for text(s)) on the backend of my django app. Here's what I'm thinking: a user uploads a text and the backend processes it and saves it as a .pickle so they don't have to re-process the text every time they want to use it. When the user wants to do something to the previously uploaded text, I have the backend open up the saved pickle, search through it, and serve that data to the user. The pickle file seems gigantic to me (600 kb text file to 17 mb .pickle) so I'm worried I'm going about this the wrong way. https://intoli.com/blog/dangerous-pickles/
|
# ? May 30, 2019 15:32 |
|
|
# ? May 27, 2024 00:31 |
|
This is super interesting but doesn't seem to be relevant to this dude's use case since the end user doesn't get the opportunity to upload their own pickles.
|
# ? May 30, 2019 15:54 |
|
You should be able to export the token data in a spacy Doc directly to numpy arrays and then use the usual numpy serialization methods.
|
# ? May 30, 2019 16:10 |
|
Is there a way to find the changelogs for packages that are updated but don't necessarily have anything useful? For example I'm using PyAutoGui. 3 days ago 0.9.43 was released. If I look at the Github (https://github.com/asweigart/pyautogui/commits/master) the latest is 0.9.42 but PyPi (https://pypi.org/project/PyAutoGUI/0.9.43/#history) shows that 0.9.43 exists but I have no idea what changed. It doesn't necessarily have to be showdiff, it's just that I'm reticent to upgrade without know what's happened given it's going to a semi-important project Sad Panda fucked around with this message at 19:34 on May 30, 2019 |
# ? May 30, 2019 19:32 |
|
Business posted:The pickle file seems gigantic to me (600 kb text file to 17 mb .pickle) so I'm worried I'm going about this the wrong way. According to docs that seems like it's to be expected. quote:When pickling spaCy’s objects like the Doc or the EntityRecognizer, keep in mind that they all require the shared Vocab (which includes the string to hash mappings, label schemes and optional vectors). This means that their pickled representations can become very large, especially if you have word vectors loaded, because it won’t only include the object itself, but also the entire shared vocab it depends on. Only thing I can think to check is if v2 has some new stuff that makes it easier/better (these are v1 docs I think?) it's not really clear to me whether the built-in to_bytes() / from_bytes() methods can apply to Doc objects or only pipelines or whatever (I have no experience in NLP so this is all greek to me): https://spacy.io/usage/v2#migrating-saving-loading I would not store these in a database generally but maybe some of the new object / document store databases would work? Not sure what they buy you over just saving to disk if you can't actually make use of the data (e.g. what does stuffing binary blobs into a db get you? I don't know but check em out there might be some benefit there).
|
# ? May 30, 2019 19:58 |
|
Sad Panda posted:Is there a way to find the changelogs for packages that are updated but don't necessarily have anything useful? For example I'm using PyAutoGui. 3 days ago 0.9.43 was released. If I look at the Github (https://github.com/asweigart/pyautogui/commits/master) the latest is 0.9.42 but PyPi (https://pypi.org/project/PyAutoGUI/0.9.43/#history) shows that 0.9.43 exists but I have no idea what changed. I'd open an issue on the repo and ask why there are releases but no changes to the repo. Their changes document hasn't been updated since 0.9.40 either. They just released another one today, even. You're only option at this point is to diff the package contents.
|
# ? May 30, 2019 20:28 |
|
Thanks for the responses all they are helpful. Going to look into storing as numpy arrays and/or being more selective about what I store. The main issue is that my texts are large enough that I have to stream it as a chopped up list of Docs, and then find a way to store that and the associated vocab data without resorting to writing it to bytes. The blobs would give me everything, and they good for my own offline purposes, but its helpful to have it confirmed that it's weird to store a big binary data thing for each file.
|
# ? May 30, 2019 21:54 |
|
I worked out my pip problem. Sort of. Using - - no-use-pep517 gets me past the issue. It's concerning that the same problem exists on multiple devices.
|
# ? May 30, 2019 21:59 |
|
Business posted:Thanks for the responses all they are helpful. Going to look into storing as numpy arrays and/or being more selective about what I store. The main issue is that my texts are large enough that I have to stream it as a chopped up list of Docs, and then find a way to store that and the associated vocab data without resorting to writing it to bytes. The blobs would give me everything, and they good for my own offline purposes, but its helpful to have it confirmed that it's weird to store a big binary data thing for each file. Preprocessing texts in spacy shouldn't be that compute intensive so consider just storing the raw texts and computing on demand especially if users are given the option to upload new texts. Trade CPU time for storage - even with numpy, I suspect you're going to be dealing with large objects since there's a lot of token level metadata.
|
# ? May 31, 2019 00:50 |
|
shrike82 posted:Preprocessing texts in spacy shouldn't be that compute intensive so consider just storing the raw texts and computing on demand especially if users are given the option to upload new texts. I imagine this is the way you'd want to go because spaCy is taking text and adding a whole lot of extra data to it by its very nature. Depending on what you're up to you could also split the difference and store the original text and whatever the relevant processed/aggregated results are but not the entire marked up data from spaCy.
|
# ? May 31, 2019 13:47 |
|
Is it awful practice to wrap import errors in a generic ‘Make sure you’re using the right virtual environment ‘ type reminder?
|
# ? May 31, 2019 14:08 |
|
the yeti posted:Is it awful practice to wrap import errors in a generic ‘Make sure you’re using the right virtual environment ‘ type reminder? As someone who us often in the wrong virtual environment, yes because that's sort of implied by the error. Anyone who uses a virtual environment ie non new users, that would be the first thing you check.
|
# ? May 31, 2019 15:44 |
|
I think if you do something likecode:
Also, for anybody experienced with using python this type of message would be completely unneeded as they'd all know what to do/check, so it'd just be for beginners. Boris Galerkin fucked around with this message at 16:28 on May 31, 2019 |
# ? May 31, 2019 16:24 |
|
shrike82 posted:Preprocessing texts in spacy shouldn't be that compute intensive so consider just storing the raw texts and computing on demand especially if users are given the option to upload new texts. From testing things out on my own it seems like everything up through part of speech tagging is no problem, but dependencies and NER take a long time. About ~25 seconds on my laptop that has 8 gigs of ram for a more or less typical use case I have in mind. Which seems prohibitively expensive in terms of cloud CPU time but I have no clue, was just working under the assumption that storage is way cheaper than computation. Sorry this is getting kinda far afield but cool to see other people are using spacy
|
# ? May 31, 2019 16:46 |
|
Boris Galerkin posted:I think if you do something like Yeah that’s what I had in mind, I’m dealing with DBAs who don’t know much/any python but will run the stuff I’m writing now and then to check data going into their db.
|
# ? May 31, 2019 17:07 |
|
the yeti posted:Yeah that’s what I had in mind, I’m dealing with DBAs who don’t know much/any python but will run the stuff I’m writing now and then to check data going into their db. Yeah, if you know your audience you should write for your audience.
|
# ? May 31, 2019 18:57 |
|
Trying to learn Flask & Python to do what I need it to do....right now I'm trying to get a integer from a HTML number form and times it by a set amount (in this example 12) and then return the result to display on the HTML page...basically ask the user how many tickets they want and then times it by the ticket cost. I know I'm doing this wrong anyways but I think if I can get this one thing working I can expand out and do what I actually want to accomplish. The problem I'm having though is when I run the application, I don't get x * 12 as a result I get the integer x printed 12 times....so not sure what I'm doing wrong. Anyway one of you could take a look and give a hint or something? I tried to turn the input aduTix into a int to times by twelve but it returns a internal server error. I don't really understand what I'm doing but I can kinda understand the code I'm writing. Thank you all for the help! \ EDIT: The internal server error I get says "int" or "float' or "str" aren't supported. EDIT2: I figured it out! Flask doesn't like to return a int, it has to be a string or something else I believe here's my html: code:
code:
Empress Brosephine fucked around with this message at 20:21 on May 31, 2019 |
# ? May 31, 2019 19:39 |
|
Empress Brosephine posted:Trying to learn Flask & Python to do what I need it to do....right now I'm trying to get a integer from a HTML number form and times it by a set amount (in this example 12) and then return the result to display on the HTML page...basically ask the user how many tickets they want and then times it by the ticket cost. I know I'm doing this wrong anyways but I think if I can get this one thing working I can expand out and do what I actually want to accomplish. The problem I'm having though is when I run the application, I don't get x * 12 as a result I get the integer x printed 12 times....so not sure what I'm doing wrong. Anyway one of you could take a look and give a hint or something? I tried to turn the input aduTix into a int to times by twelve but it returns a internal server error. I don't really understand what I'm doing but I can kinda understand the code I'm writing. Thank you all for the help! \ This should help explain why it wasn't working. http://flask.pocoo.org/docs/1.0/quickstart/#about-responses
|
# ? May 31, 2019 20:27 |
|
Thanks for the help with Flask. Another quick question, is it possible to take data input from a form like a date and then insert it into a a href such as E; nvm keep answering my own questions Empress Brosephine fucked around with this message at 17:44 on Jun 1, 2019 |
# ? Jun 1, 2019 17:36 |
|
Hey dudes. Is it possible to set mobile conditions from a Django template? I'm not v good with them and forms, but I have a form for file upload that's in a template since I'm not sure how to do it on frontend. Want to hide it if on mobile. Normally this is done by checking window.innerWidth, but not sure how to do in a template or Django form. Answer might just be to move to frontend.
|
# ? Jun 1, 2019 17:42 |
|
Dominoes posted:Hey dudes. Is it possible to set mobile conditions from a Django template? I'm not v good with them and forms, but I have a form for file upload that's in a template since I'm not sure how to do it on frontend. Want to hide it if on mobile. Normally this is done by checking window.innerWidth, but not sure how to do in a template or Django form. Answer might just be to move to frontend. You can parse request.META['HTTP_USER_AGENT'] and then see if its a mobile browser. There's various libraries to help with this. Personally, I avoid doing this kinda stuff. Instead I'd use media queries in my CSS.
|
# ? Jun 1, 2019 17:48 |
|
I'd like a method to run once a month, on a machine that's randomly on (I'd prefer to keep-it full python if I can so no cron etc) It's not important that it runs every 30 days, or every 1st of the month. If the main script hasn't been executed for more than a month it should run the task once, and keep track for next time I've done a quick test with schedule but I don't think it's what I need. What are my options ? I can whip up a lovely thing by writing a timestamp to a file but maybe there's a better way.
|
# ? Jun 2, 2019 18:18 |
|
unpacked robinhood posted:I'd like a method to run once a month, on a machine that's randomly on (I'd prefer to keep-it full python if I can so no cron etc) If the machine might reboot at unpredictable times I don't see how you can guarantee anything without touching the external environment. Even if you write your own custom logic to keep track of how long to wait between runs you still need to somehow tell the system to turn the thing doing the waiting back on when it reboots, and that is going to necessarily involve configuring something at the system level (i.e. not "full python"). I would just make a cron job that runs once per day, and have the script check if it's been at least a month since it was last run before doing anything.
|
# ? Jun 2, 2019 18:37 |
|
Definitely use cron
|
# ? Jun 2, 2019 19:03 |
|
QuarkJets posted:Definitely use cron Nippashish posted:I would just make a cron job that runs once per day, and have the script check if it's been at least a month since it was last run before doing anything. The script will run at least once when the machine is on, probably at boot time, or from user input. I'll take a look at cron then. Any libraries to deal with the second part ? I've written a thing but I don't really trust myself with reliability and edge cases
|
# ? Jun 2, 2019 19:14 |
unpacked robinhood posted:The script will run at least once when the machine is on, probably at boot time, or from user input. I'll take a look at cron then. I would just write month of execution into a text file, and compare against it on each run if that value has changed since the last time, or does not exist.
|
|
# ? Jun 2, 2019 19:29 |
|
cinci zoo sniper posted:I would just write month of execution into a text file, and compare against it on each run if that value has changed since the last time, or does not exist. I like this better than dicking around with timestamps, thanks
|
# ? Jun 2, 2019 19:38 |
|
Right now I have a variable that is a request.form from a html date picker that assigns a value of XXXX-XX-XX to the variable. How would I write a if statement that says if variable is between let’s say 2019-01-01 and 2019-05-31, is that possible? I’m not sure if Python or myself is smart enough to know that the variable is a integer and not just a random string. Right now I have about a 50 int long “or” statement looking for certain dates to trigger a fail. Thanks for the help.
|
# ? Jun 2, 2019 20:34 |
|
The datetime module has problems I don't totally understand (mostly having to do with time zone shenanigans, I think?) but that way I'd do it is turn the XXXX-XX-XX string into a date object (using fromisoformat) and do your comparisons between those.
|
# ? Jun 2, 2019 20:51 |
|
Empress Brosephine posted:Right now I have a variable that is a request.form from a html date picker that assigns a value of XXXX-XX-XX to the variable. How would I write a if statement that says if variable is between let’s say 2019-01-01 and 2019-05-31, is that possible? I’m not sure if Python or myself is smart enough to know that the variable is a integer and not just a random string. Right now I have about a 50 int long “or” statement looking for certain dates to trigger a fail. Python code:
|
# ? Jun 2, 2019 20:56 |
|
Empress Brosephine posted:Right now I have a variable that is a request.form from a html date picker that assigns a value of XXXX-XX-XX to the variable. How would I write a if statement that says if variable is between let’s say 2019-01-01 and 2019-05-31, is that possible? I’m not sure if Python or myself is smart enough to know that the variable is a integer and not just a random string. Right now I have about a 50 int long “or” statement looking for certain dates to trigger a fail. Python code:
|
# ? Jun 2, 2019 20:56 |
|
unpacked robinhood posted:
I had never heard of pendulum until now and have definitely been bitten by datetime issues or wrote too complicated of code for that kind of bs. Absolutely fantastic.
|
# ? Jun 2, 2019 21:30 |
Pendulum owns tbh
|
|
# ? Jun 2, 2019 23:12 |
|
Thank you al so much. Will this work with Flask?
|
# ? Jun 2, 2019 23:14 |
|
Python code:
Thermopyle posted:You can parse request.META['HTTP_USER_AGENT'] and then see if its a mobile browser. There's various libraries to help with this.
|
# ? Jun 3, 2019 00:20 |
|
Liking the Jetson nano, but the software dev team needs to get their poo poo together. Some stuff is pre installed with their Jetpack SDK, some via other means. What is making GBS threads me is opencv. I don't know what they did but it's a part of the SDK. Trouble is it doesn't appear in any venvs. Can someone point me in the right direction of how to deal with this, if at all possible?
|
# ? Jun 3, 2019 09:10 |
|
Is that poetry package manager the pendulum people have any good?
|
# ? Jun 3, 2019 14:28 |
|
It's like if pipenv wasn't written by an rear end in a top hat, so yes
|
# ? Jun 3, 2019 16:48 |
|
Malcolm XML posted:It's like if pipenv wasn't written by an rear end in a top hat, so yes I figured at least one response would be along that line. In a practical sense pipenv is really pissing me off because packages keep breaking it (pendulum does) and the excuse is always vendored dependencies.
|
# ? Jun 3, 2019 17:09 |
|
|
# ? May 27, 2024 00:31 |
|
So I finished Python Crash Course and loved it; what should I read next to improve my skills? Is it worth learning more than the blade level of skills with Flask? Thanks all.
|
# ? Jun 4, 2019 01:48 |