|
Thermopyle posted:I came across this article when Googling for something unrelated and I found it semi-interesting. This is good to know. Also, hi thread. I am learning python to do some web scraping and data manipulation and eventually machine learning stuff. Holy crap is it easy. I got all the data from a webpage into a csv with like 4 lines of code (pandas) and theres 19 bajillion examples of how to do this online. CarForumPoster fucked around with this message at 11:43 on Oct 6, 2017 |
# ¿ Oct 6, 2017 11:19 |
|
|
# ¿ Apr 28, 2024 20:29 |
|
Hughmoris posted:I'm working my way through Automate The Boring Stuff and am on the Web Scraping section. Just curious, with your pandas example, are you scraping full tables or are you using selectors to nab individual items and then building a dataframe? I just started yesterday so right now I nab everything and dump it into a CSV. Also, is there a machine learning thread? Im wondering if I could actually skip this step all together. I have the problem of similar data like names, dates, events are stored in tables on lots of webpages in semi-different formats on each website. I want to scrape that data and have ~machine learning~(or whatever) sort it out for me such that it is stored in a way I can add to a database. (CSV or something like that) I have ~50 sample webpages I could format how I'd like (desired output) and use to train a model with and could easily get a few hundred if it was worth while. I know I'd need a way to vectorize either the table info or the html all together...havent figured that out yet. Machine learning thread? CarForumPoster fucked around with this message at 17:29 on Oct 6, 2017 |
# ¿ Oct 6, 2017 17:25 |
|
Cingulate posted:You can try the data science thread, or the stats thread. Your understanding is correct. I want a webscraper more flexible than most (all?) web scrapers to scrape a certain type of data and put it in a database. (Dates and info usually stored in tables on many different webpage layouts) I.e. I want a flexible data formatting tool. I want to use machine learning to do some data wrangling.
|
# ¿ Oct 6, 2017 21:33 |
|
Dominoes posted:Does anything in the OP look stale? The Heroku link directs you to a login rather than info about heroku.
|
# ¿ Oct 8, 2017 14:05 |
|
FAGGY CLAUSE posted:Not sure this is the exact thread for it, but I'm using Python to build out a prototype. I don't have a lot of experience but also have this exact problem so I'm curious what you find out. Why no cloud services though?
|
# ¿ Dec 3, 2017 13:21 |
|
FAGGY CLAUSE posted:Classified documents. Are you this guy: https://www.youtube.com/watch?v=h6TRYcx74qs
|
# ¿ Dec 3, 2017 14:47 |
|
Sad Panda posted:Very. I'll be logging in to each site and there's not going to be much standardisation. I do exactly this with a different, easier, non-python solution. For what industry and purpose are you doing this project?
|
# ¿ May 20, 2018 14:00 |
|
Sad Panda posted:It's personal use, no form of industry. I'm logging into betting websites. What's the easier option? I considered a Pyautogui setup which is just recording a series of clicks and using OCR + Image Recognition, but Selenium seemed like an interesting way to start. Some learning curve to this, decent documentation, and $150/month but many good features. It was much mroe robust and allowed faster deployment than trying anything with BeautifulSoup and the like. https://www.parsehub.com/ EDIT: Its free to use with a limited feature set and what not but I ended up needing several of their not free features. E.G. I dont want my projects to become public. CarForumPoster fucked around with this message at 18:36 on May 20, 2018 |
# ¿ May 20, 2018 18:28 |
|
Short question: When you guys write python code, do you write a function under a def and then if __name__ = __main__ do the function you just wrote? Reason: I've been tinkering with python for a while now but am finally working on a little project (automated GUI/Web interaction using selenium or pyautogui to automate writing and uploading test reports) that needs me to write a bunch of little functions to call with a wrapper based on what I need to do that day. I wrote my code then when I was done, indented it and wrote def funcname(): followed by if __name__ = __main__ do the function. I feel like I should do this for everything I write. Do you guys do this this way?
|
# ¿ Jun 7, 2018 23:05 |
|
I have a dumb newbie question but I feel like I'm missing something obvious. I have a function to make QRcodes: code:
df = pd.read_excel("data.xlsx"), giving: code:
What do I need to be doing?
|
# ¿ Jun 10, 2018 00:49 |
|
huhu posted:I don't see a loop anywhere? I can't be sure because I've not used the functions you're discussing but I think you're assuming something is iterating when it's not. I thought df.apply was supposed to act like a loop EDIT: Thanks for the help,. I got it! code:
CarForumPoster fucked around with this message at 01:18 on Jun 10, 2018 |
# ¿ Jun 10, 2018 01:03 |
|
SurgicalOntologist posted:You could do it with apply too, just need to set up the function so it takes a row as input. Im trying to learn the whole "stop using for loops" thing, can you give me an example of it with df.apply?
|
# ¿ Jun 10, 2018 03:34 |
|
SurgicalOntologist posted:No. In python you should never have to manage your iteration index. Maybe in a while loop or some other place where in some sense you're managing the loop logic yourself. But 99% of the time when I do that I realize there's a better way before I finish coding the loop. Do you have anything I can read (or would you share) about why this is, particularly in the case of a while loop? For example I wrote something that clicks a button, takes a screen shot and iterates the file name like: (I dont have the code handy) Is this bad? code:
|
# ¿ Jun 10, 2018 17:12 |
|
baka kaba posted:I think CarForumPoster's example is a whole other thing, that's just an infinite loop that happens to contain an incrementing counter. It's not actually iterating over anything like a collection or a fixed range of values, and I think that's the simplest way to achieve it? Correct. I have to watch a zillion training videos for a new job and I am too lazy to click next and they had tests, but I appreciate the other perspectives and solutions its definitely a problem I encounter a lot where I need to do it over a limited number of items in a list or something.
|
# ¿ Jun 10, 2018 17:45 |
|
This was really helpful
|
# ¿ Jun 11, 2018 03:12 |
|
I didnt know that about PyCharm, thanks thread!
|
# ¿ Jun 19, 2018 02:59 |
|
If anyone ever wants a really simple GUI, particularly to make a form for taking arguments and running a script, big shout out to Gooey. Tutorial here: http://pbpython.com/pandas-gui.html I used this and a python to exe maker and made some little tools that helped my coworkers who would be scared of the command line.
|
# ¿ Jul 10, 2018 00:52 |
|
I would like to make a pandas dataframe out of a JSON file containing scraped court data. I'd like all the data to appear in columns but there's lots of nesting and I can't figure out how to get it in a nice big table like I'd like. I thought the way to go would be something like: code:
csvkit can give me a csv output which is easier but it requires a ton of reoganizing and clean up. I figured a custom solution might be in order. Data sample: (Very long so I pastebinned) https://pastebin.com/PhDkrCME
|
# ¿ Jul 14, 2018 16:15 |
|
cinci zoo sniper posted:You won't be able to trivially map a decently hierarchical JSON document to a table or set of tables. I had to solve a similar problem at work and my final (naturally a temporary quick and dirt one to be rewritten better at the time) solution was to write a MongoDB script that does some server-side filtering and removes unnecessary fields and everything, a long-ish Python script with functions to read what i need and write it into pandas. On Python side I did iterate over individual JSON documents. The process was good enough to comb through a couple gigabytes of JSON poo poo reasonably quickly. I dont mind some time investment in solving this issue. I will need to do it weekly or daily and the data will be formatted like this always. Whats the best way to solve it and got a link to learn more about it? I have no experience with MongoDB and dont control the server sending the data but am willing to learn. CarForumPoster fucked around with this message at 16:58 on Jul 14, 2018 |
# ¿ Jul 14, 2018 16:55 |
|
cinci zoo sniper posted:What's your desired output, to roughly sketch on that sample you linked? I've not seen that many good links on this to be honest, just random blog posts ways back. It's a technologically simple problem that can be labour-"intensive" if your source data and your desired output don't cooperate (like it often is for me since if there's one thing that people's credit histories are not then it's concise). I'll take a look at your file a bit later, on the go right now. My goal is to use the data in a variety of machine learning, NLP and statistics tasks. CSVkit (the 70% option) may be a better option than I once conceived. The goal would be to map the fields to a set of names that I can play with. E.g. right now I take judge_name, one hot encode them, and look for correlations to outcomes. code:
CarForumPoster fucked around with this message at 17:20 on Jul 14, 2018 |
# ¿ Jul 14, 2018 17:06 |
|
I love you thread and helpful people within.
|
# ¿ Jul 18, 2018 01:12 |
|
edmund745 posted:Also while searching Google for "python try except", I got the Google code challenge! ,,,,,,,,,Or does everybody get these now? I have googled exactly that previously and didn't get it.
|
# ¿ Aug 27, 2018 21:37 |
|
edmund745 posted:Maybe it's a lifetime average sort of thing? Most of what I do search for is digital electronics (part datasheets) and programming stuff. I've known multiple google engineers and I doubt this would exclude you.
|
# ¿ Aug 28, 2018 01:28 |
|
edmund745 posted:Yes but I'm kinda white-trashy so I figured I'd be in the slummy part of Palo Alto. So maybe that's a 1-mil neighborhood? You joke but there is a slummy part of Palo Alto called East Palo Alto. Its separated by 101 and its unmetaphorically the other side of the
|
# ¿ Aug 28, 2018 21:53 |
|
Thermopyle posted:VS Code is a great editor, but it's not even the same league as PyCharm or IntelliJ as an IDE. My company makes us use VS Express Desktop 2015 and it is completely elbows and dildoes compared to the free PyCharm I use on personal projects.
|
# ¿ Sep 6, 2018 22:08 |
|
CarForumPoster posted:completely elbows and dildoes PBS posted:I have to know what this means. Uncomfortably bad.
|
# ¿ Sep 9, 2018 01:04 |
|
cinci zoo sniper posted:https://github.com/mahmoud/boltons Some cool stuff. This, particularly remap, looks really useful.
|
# ¿ Oct 17, 2018 01:41 |
|
Jose Cuervo posted:Not sure if this is the right place to ask this, but I want to generate a time series which simulates consumption of a product. One way of doing this is to assume that the consumption has a particular distribution, say Triangular(lower=2, mode=4, upper=5), and then at each time step draw a random variate from that distribution to simulate the amount consumed during that time step. However, generating the time series in this way does not produce any correlation in consumption between successive time steps. That is, if the consumption of the product was on the higher end of the distribution at time t, then the consumption of the product at time t+1 should likely be on the higher end of the distribution as well, and vice versa. If you’re generating based on random variables (0,1) that follow a triangular CDF, you could make an if/then that looks at the random variable from t-1 and keeps generating numbers until the random number is within x distance/percentage/etc of the previous
|
# ¿ Oct 27, 2018 04:14 |
|
I have a dumb question, at some point in developing stuff does it become more natural to read and work with json? Like are there some of you who can deal with json as intuitively as you would data in a table/dataframe?
|
# ¿ Nov 22, 2018 09:57 |
|
I have an easy one thats frustrating me. I want to replace a bunch of strings inside df['col1'][i] with values from df['col2'][i] I have a column in a dataframe df_maps called name with strings in it: Bob Dan Sean Another called param1 A B C I have a column called 'html' with some html in it and stuff to replace ($name and $param1) <h2>$name</h2><span style="color: #008000;"><strong>$param1</strong> This executes: code:
<h2>$name</h2><span style="color: #008000;"><strong>$param1</strong> instead of <h2>Dan</h2><span style="color: #008000;"><strong>$param1</strong>
|
# ¿ Nov 27, 2018 21:02 |
|
SurgicalOntologist posted:You are re-creating the column 'combined' at every iteration of the loop, essentially resetting it. Try this: I feel like an idiot and that worked.
|
# ¿ Nov 27, 2018 21:21 |
|
What I want to do: Split a dataframe column of name strings into first and last names using nameparser. Have the first and last names in their own columns. Problem: I get an object with the original name which contains all of the name parts in a "name" object rather than a series that is appended as new columns. Any suggestions on how to properly solve this problem? code:
code:
EDIT: This worked. code:
CarForumPoster fucked around with this message at 13:24 on Dec 4, 2018 |
# ¿ Dec 4, 2018 12:49 |
|
I got another easy/dumb one. Have df["Time"] with code:
code:
4 for i in range(len(temp)): ----> 5 if temp[i][1] == "days": 6 temp[i][0] = int(temp[i][0]) 7 elif pd.isnull(temp[i]): TypeError: 'float' object is not subscriptable But I think the real issue is the "isnull" part. I think its telling me hey there is no "ith" element to look up...but there is right?
|
# ¿ Dec 7, 2018 16:26 |
|
cinci zoo sniper posted:It tries to get 2nd element inside NULL, that temp[i] corresponds to. Also I would suggest to use dateparser for this. Good call. This works and should be more robust than my previous one: code:
|
# ¿ Dec 7, 2018 18:08 |
|
TLDR: I get ValueError: ('cannot index with vector containing NA / NaN values', 'occurred at index 0') but I dont think I have an NA values. I have code that wants to look up a string from df["StatCat"] in ArrestTable and return whether its a misdemenor, felony, other in the column "Lev". It will do eventually this for several other columns as well. Because the records arent always exacting in writing the statute number thats in StatCat, it needs to be a fuzzy logic match. code:
code:
code:
code:
code:
CarForumPoster fucked around with this message at 15:34 on Dec 12, 2018 |
# ¿ Dec 12, 2018 14:42 |
|
SurgicalOntologist posted:Instead of df_raw["StatCat"][row], try row["StatCat"] inside the function you're applying. The point of apply is that the looping is done internally so you shouldn't have to index back into the dataframe. The row that is passed to apply is not the row index, it's the actual row data as a series. cinci zoo sniper posted:Yeah I was wondering if there is some indexing magic going on that can actually work like in that example or not. This was the issue. Thank you both. The new issue is that the fuzzy logic function is hilariously slow. I need to run it 1M rows. It takes 2m 30s to run on 1000 rows. I'm going to try to figure out something with .map and str.contains or some other solution that ends up being "good enough"
|
# ¿ Dec 13, 2018 22:16 |
|
SurgicalOntologist posted:Do you have python-levenshtein installed? IIRC it's not a requirement of fuzzywuzzy but if it's available it will run faster. Other than that I would just bite the bullet and run it over the weekend, assuming you only need to run it once. Yes, still slow as balls and I'm doing EDA with a large, new data set so I tend to run it frequently. I made some improvements to the lookup table and now am up to 20% NaNs instead of 40% NaNs and it runs lightning fast, but still misses anything that's not an exact match. code:
the lookup table has 456.023&1a 456.023&1b But I am fine with returning the first one that it matches with, as its accurate enough, so fuzzywuzzy was returning the result after matching on 456.023&1a CarForumPoster fucked around with this message at 22:54 on Dec 13, 2018 |
# ¿ Dec 13, 2018 22:49 |
|
How do I do something to a range of columns in a df? I have 10 columns named "DispoClass_#" with the # being 1-10. I want to set to ordinality of the categorical values they contain with .cat.set_categories How do I select all 10? I need to do this with other things structured as "Name_#" so just writing them out isn't that deirable. Something like this, but, ya know, works... code:
|
# ¿ Dec 17, 2018 16:16 |
|
cinci zoo sniper posted:You should use df.filter() for that. Much appreciated. Also appreciate you helping me last week. Also I just now found out about : https://regex101.com/ Holy crap is that helpful! I have the worst time with regexs and I basically end up finding someone on stack overflow who wasnt the same thing and cpying the answer.
|
# ¿ Dec 17, 2018 19:25 |
|
|
# ¿ Apr 28, 2024 20:29 |
|
priznat posted:Anyone know of a good project that can webscrape for pricing information? I was thinking beautifulsoup mostly. Just have a few pages from vendors I setup watches on (like camelcamelcamel but cross vendor) ParseHub is a non programming (its firefoxes element picker for scraping best I can tell) solution that I found pretty easy to use after a few minutes of their tutorial videos. Has some limitations but if you want very quick and dirty, I like it and its free for up to 200 pages at once. You can load like search terms or whatever straight from a csv or json and loop over them which is nice.
|
# ¿ Jan 18, 2019 14:24 |