|
As another data guy, i cant really remember the last time i ran a regular python script. Notebooks are good and cool and i use Papermill all the time to run them like a script. Even cooler is Julia's Pluto library, which is notebooks combined with Excel-like reactivity.
|
# ? Dec 19, 2020 16:36 |
|
|
# ? May 15, 2024 01:23 |
|
I assumed notebooks were for sharing results, especially with interactive visuals.
|
# ? Dec 19, 2020 16:47 |
|
Dominoes posted:I assumed notebooks were for sharing results, especially with interactive visuals. When I share results I output the notebook to PDF. The few times I’ve needed to do this it’s always with nontechnical people.
|
# ? Dec 19, 2020 17:59 |
I use notebooks for research and repeatable tinkering, then just rewrite anything that's going into production in more traditional python format.
|
|
# ? Dec 19, 2020 18:09 |
|
a foolish pianist posted:I use notebooks for research and repeatable tinkering, then just rewrite anything that's going into production in more traditional python format. Same for web scrapers in selenium. Notebooks make it so much faster to iterate through the various steps.
|
# ? Dec 19, 2020 18:40 |
|
CarForumPoster posted:When I share results I output the notebook to PDF. The few times I’ve needed to do this it’s always with nontechnical people. Yeah the promise of notebooks being useful to share with non-technical people was a little oversold for me. Anyone non-technical isn't going to 1) install Python and associated libraries 2) be able to run one of my notebooks 3) care about the code. I can make a PDF but it doesn't really get me much when with just a little more work I can make a Shiny/Dash app and get actual interaction. I know there are services to kinda-sorta host them but . I've experimented a bit with using them to quickly prototype a bit of an overall analysis pipeline and then integrate it into a bigger Snakemake pipeline and that worked well enough. Nothing groundbreaking but it probably did save me some time and if I need to dig in more I can pretty easily. I use them extensively for loving around and experimenting since having visualizations or big chunks of data pop out semi-formatted in a browser cell is (usually) more useful than having to write extra code for it in a REPL via SSH or having to use a separate program or X server to let matplotlib poo poo appear over SSH. These days this usually goes hand-in-hand with using it to test out changes to an actual library that I'm writing in a real IDE. Problems with multiprocessing or async issues do pop up in certain contexts that cause my to have to go entirely out of the notebook environment though. I've also found myself extensively using Colab for teaching students since neither they nor I have to set anything up and they can't possibly break anything that can't be fixed by hitting "Factory Reset Runtime". If I can get them to not overwrite the code cells that worked it helps a little bit early on because previous work doesn't "disappear" in their brains the way it seems to on the CLI or REPL.
|
# ? Dec 19, 2020 18:55 |
|
Do you any recommendations of alternatives to scrapy? The problem I'm trying to solve is I built out one scrapy workflow that goes on a task queue and given Twisted reactor and the threading/singleton stuff I've run in to, it's just obvious I am really trying to "get around" a very opinionated framework. If this means I need to learn it better I'm willing to, but at this point I'm ready to consider my current implementation a throwaway prototype if there's something else. My requirements would be just basic HTTP scraping, selection on XPaths, populating/POSTing form data, with the ability to fire up Selenium if I need to. I am typically saving entire documents instead of snippets.
|
# ? Dec 21, 2020 18:40 |
|
Hed posted:Do you any recommendations of alternatives to scrapy? I'm partly through converting our scrapers, which are all python + selenium running on one machine to running all on AWS lambda using headless chrome following an AWS Step Function. Haven't finished yet but its looking like it'll be highly flexible, trivially scheduled, AWS SAM makes deployments easy, and let us have some pretty complex business logic that needs to go along with the scrapers. And also cost almost nothing because it's AWS lambda and these things combined only have like 2 hrs of compute time per day. We're doing this ahead of building out a large number of scrapers and lead enrichment functions that need to fail in easily tracked ways and be able to quit being enriched based on various logic. Let me know if I can help, happy to share what I've learned....it has been a steep learning curve trying to make these work reliably.
|
# ? Dec 21, 2020 21:32 |
|
Can you run headless chrome inside a lambda ? I wrote a selenium/chrome container thing recently and I wouldn't know how to convert it to a lambda.
|
# ? Dec 21, 2020 23:06 |
|
unpacked robinhood posted:Can you run headless chrome inside a lambda ? I can say for sure you can and that the SAM (lambci docker container) emulates it decently well so you can test locally before deploying. Theres no conversion needed, really you just need a self contained binary. That said, it gets to 250MB quickly but I got around this by adding EFS to my lambda. EDIT: I should caveat that by no "conversion" needed I mean that you can use your code mostly as written. Obvs youll need to point to chrome executable and chromedriver. I put mine on EFS. You'll probably need to write some wrapper for the actual scraper code that is called by the lambda_handler though. Also logging will need some sort of treatment. The easy answer if you append logs to a local file now is to simply log to a file you append to in EFS. We havent made a final decision on cloudwatch versus this log style yet. CarForumPoster fucked around with this message at 03:03 on Dec 22, 2020 |
# ? Dec 22, 2020 01:57 |
|
A bit late to notebook chat, but (as a probable Luddite) I exclusively use regular Python scripts. I love the argparse module. Takes a bit to get the scripts into a good spot, but once you do it's real nice to just be like python script.py do-thing resources=/var/path/to/resources whenever you need to kick off something.
|
# ? Dec 22, 2020 18:07 |
|
Hed posted:Do you any recommendations of alternatives to scrapy? If you want to use simple html parsing for some pages and selenium for other pages that may make things a little more complicated, though.
|
# ? Dec 22, 2020 19:00 |
|
Re the notebook chat: It's always something I've had in my hip pocket as a tool I'll know is available when the use came up. I was surprised to hear (partly from articles critical of its non-linear excecution) how it's often used in practice, ie for broader use cases.
|
# ? Dec 22, 2020 22:29 |
|
Dominoes posted:Re the notebook chat: It's always something I've had in my hip pocket as a tool I'll know is available when the use came up. I was surprised to hear (partly from articles critical of its non-linear excecution) how it's often used in practice, ie for broader use cases. There are lots of situations where notebooks aren't as good (basically anything that you could easily write in an IDE without access to runtime information).
|
# ? Dec 23, 2020 01:00 |
mystes posted:The "hidden state" issue isn't something you're normally going to run into as long as you're at all careful, and notebooks are very good for some applications. Actually the situation of web scraping someone was just asking about is a good example because you want to keep the state around while you're working on it and it's much poke around with web pages interactively, and it's a lot easier to organize that in a notebook than a repl. Finally I have something to try notebooks for
|
|
# ? Dec 23, 2020 01:05 |
|
Bundy posted:Finally I have something to try notebooks for I need to do a one-off scraping of a website every month or so and jupyter is the best for this by far. Here's an example workflow Requirement: For purposes of recruiting an attorney barred in states A and B who has experience in labor and employment AND/OR business litigation, get a list of all attorneys that State Bar B says are barred in state A and B. How I use Jupyter to make this very fast: Fire up selenium, manually navigate to the bar's search page, search manually, then have a quick n dirty selenium script grab the results from each page and put them in a DataFrame I dump to excel. I was able to get a list of probably every possible candidate in the nation that fulfilled the requirements including very detailed other info like practice areas of law, what year they graduated law school, phone numbers, emails, where they're working now, etc. This is way better than any head hunter or hitting up linkedin could ever do and it was a 2 hour job. I was able to find 40 potential candidates for some exceptionally rare job requirements in just a few hours.
|
# ? Dec 23, 2020 03:56 |
|
I've only seen Jupyter being used when I did the MIT Intro to CS & Programming course.. However, I have spent many days writing Python to scrape using Selenium. Are you saying that instead of having to re-run the script every time I wanted to try out a new element I could have just used Jupyter instead? I used PyCharm's debug mode to write, but Jupyter seems like it would have been a lot simpler.
|
# ? Dec 23, 2020 09:21 |
|
Sad Panda posted:I've only seen Jupyter being used when I did the MIT Intro to CS & Programming course.. However, I have spent many days writing Python to scrape using Selenium. Are you saying that instead of having to re-run the script every time I wanted to try out a new element I could have just used Jupyter instead? I used PyCharm's debug mode to write, but Jupyter seems like it would have been a lot simpler. Yep!
|
# ? Dec 23, 2020 11:06 |
CarForumPoster posted:a one-off...every month
|
|
# ? Dec 24, 2020 17:45 |
|
Its a different one-off every month. Some examples have been: -All PACER court records for a particular plaintiffs attorney -All State A records of attorneys barred in states A and B -All State B records of attorneys barred in states A, B and C -All ______ county records where the defendant is Liberty Mutual Insurance etc. These come up about once per month and I can bang out something good enough to be useful but not ready for production in about 2 hours thanks to jupyter. I am actually interviewing two people I found this way who, if joining, could represent a 25% increase in revenue over the next year or so. The business impacts of this system are pretty huge.
|
# ? Dec 24, 2020 18:09 |
|
Sad Panda posted:I've only seen Jupyter being used when I did the MIT Intro to CS & Programming course.. However, I have spent many days writing Python to scrape using Selenium. Are you saying that instead of having to re-run the script every time I wanted to try out a new element I could have just used Jupyter instead? I used PyCharm's debug mode to write, but Jupyter seems like it would have been a lot simpler. You can use "scientific mode" in pycharm, and get the best of both worlds.. Basicly you create cells using #%%, and then execute code stepwise, but still get all of the support that pycharm provides. With all variables available for inspection in the variable inspector or whatever it is called.
|
# ? Dec 25, 2020 18:18 |
|
For me notebooks were the thing that got me the slog of getting to at least to the point where I can write simple actual programs. The immediate reactivity and ability to see what variables are doing step by step without having to use a debugging program was amazing. Even though I do most stuff in nano or sublime text now, when I’m dealing with a python concept I’m not familiar I go to notebooks to gently caress around with it first.
|
# ? Dec 26, 2020 13:13 |
|
So I'm at the point where I'm finally putting together a real python project and I'm trying to understand package/module structure better. I'm working on an ML project where I want to test different pipelines by stringing together different combinations of data feeders, models, and trackers. Don't need to get into too much detail about this, but, the way I've gone about structuring my package is as follows:code:
code:
code:
|
# ? Dec 26, 2020 18:48 |
|
Cyril Sneer posted:But I'm not sure how to do that without requiring that somehow everyone put their trackers into the same file. Maybe some __init__.py trickery can be helpful here? Yeah that's basically it. I think there are different approaches but what I do is I think for every __init__.py "is there anything in this package [folder with an __init__.py] that I want to be able to import from the level above?" and if the answer is yes, do this (using your example): Python code:
Python code:
SurgicalOntologist fucked around with this message at 19:52 on Dec 26, 2020 |
# ? Dec 26, 2020 19:48 |
|
SurgicalOntologist posted:
Also what I do.
|
# ? Dec 27, 2020 00:10 |
|
Great, thanks, that was helpful!
|
# ? Dec 27, 2020 01:24 |
|
just a minor suggestion that's it's also cool & good to avoid the import convenience shuffling as long as possible and stick to slamming alt-enter in pycharm as long as possible. honestly even then i'd still prefer something siloed off like "from butts.quickstart import Butt, Fart, poo poo"
|
# ? Dec 27, 2020 05:19 |
|
As a follow up, is there a way to allow dynamic assignment of classes? In order to assemble my pipeline, I was thinking of using a configuration dictionary, with the keys specifying the particular model to use, amongst other things. I.e., something like: code:
code:
code:
|
# ? Dec 28, 2020 19:26 |
|
Look at the __import__ function.
|
# ? Dec 28, 2020 19:34 |
|
Cyril Sneer posted:As a follow up, is there a way to allow dynamic assignment of classes? In order to assemble my pipeline, I was thinking of using a configuration dictionary, with the keys specifying the particular model to use, amongst other things. You can literally do this, classes are objects too and defining them with the class keyword creates an entry for that class object. You can call them to call their constructors, too code:
|
# ? Dec 28, 2020 19:38 |
|
You can also getattr the class name out of the module that the class is defined in. If you need to pull specific classes based on something like a text config file this works well. Assuming all of your models are present in the models module: Python code:
|
# ? Dec 28, 2020 19:40 |
|
Phobeste posted:You can literally do this, classes are objects too and defining them with the class keyword creates an entry for that class object. You can call them to call their constructors, too Thanks, yes I know this and probably shouldn't have used that example. I don't want to have the class object itself in the dictionary, but rather, use a descriptive entry for ease of use. So the user would just specify something like "vgg" or "resnet" and that would map to the appropriate class.
|
# ? Dec 28, 2020 20:32 |
|
Cyril Sneer posted:As a follow up, is there a way to allow dynamic assignment of classes? In order to assemble my pipeline, I was thinking of using a configuration dictionary, with the keys specifying the particular model to use, amongst other things. I recommend dataclasses for configs, since they likely have a fixed set of settings (fields), and the values are of different types. Comparatively, dictionaries leave you open to errors that will surprise you at runtime. Use an Enum to define allowed classes that can be in the config: Python code:
Dominoes fucked around with this message at 21:16 on Dec 28, 2020 |
# ? Dec 28, 2020 20:59 |
|
Interesting, thanks. I'm a bit confused by your example code though. Is MODEL_A an actual class definition (in which case how can that assignment work?), or something else?
|
# ? Dec 28, 2020 21:43 |
|
SelectedModel.MODEL_A is an enum variant. You can think of an enum as a way to list choices. (Or as a binary with more than 2 variants, and semantic meaning to each variant.) When processing your config, you run different code depending on which variant the predictor field holds. This way your IDE etc will only allow certain classes to be selected in the config. Enums allow you to specify only valid models. Your example used strings; presumably not every string (or every class) is a valid option in your config! If you plan to serialize your config, instead of using enum.auto(), specify an integer for each variant, so your serialization is consistent. Dominoes fucked around with this message at 22:00 on Dec 28, 2020 |
# ? Dec 28, 2020 21:53 |
|
double post, ignore.
|
# ? Dec 28, 2020 21:55 |
|
I do something like this with a discord bot I wrote where I abuse importlib to hot load/unload plugins without restarting the bot. Commands check the plugin list for an attribute that matches the command list in each plugin and fires off coroutines in parallel.
|
# ? Dec 29, 2020 04:06 |
|
I understand the use of virtualenvs and such for keeping different projects' dependencies isolated from the system and each other, but if I actually do want the latest version of something installed system-wide is sudo pip install actually the best way to go? My specific use case right now is for youtube-dl but there are a few other utilities written in Python that use pip as their official package manager but I'd generally want to have installed system-wide with no need to activate environments before use.
|
# ? Dec 30, 2020 19:16 |
|
In Windows, pip install is fine. On Linux, you risk putting your system in a totalled state due to system reliance on the python install it comes with. Try this: -Download the latest Python source from python.org -Install it using `configure`, `make` and `sudo make install` from the directory you unpacked it to -Use its pip: `python3.9 -m pip install apackage`. This way, you don't risk modifying a package your OS relies on. Or, roll the dice with your system python's pip. It will probably be fine. Dominoes fucked around with this message at 19:39 on Dec 30, 2020 |
# ? Dec 30, 2020 19:36 |
|
|
# ? May 15, 2024 01:23 |
Similarly on OSX messing with system Python will wreck your poo poo, just absolutely don’t.
|
|
# ? Dec 30, 2020 19:37 |