|
What have I got myself into...
|
# ? Nov 11, 2017 00:07 |
|
|
# ? May 22, 2024 05:50 |
|
Plinkey posted:What have I got myself into... not being especially fluent in interwebbing I have no idea.
|
# ? Nov 11, 2017 00:25 |
|
Plinkey posted:What have I got myself into... Building a searchable database that will be more modern, quick, and easier to read? EDIT: RIP your soul... please make a public interface Da Mott Man fucked around with this message at 02:00 on Nov 11, 2017 |
# ? Nov 11, 2017 01:57 |
|
CNN’s Jake Tapper Asks Republican: Would You Vote For A ‘Child Molester’? As if we don't know what Freep's answer would be... quote:To: markomalley quote:To: markomalley But Chappaquiddick! quote:To: markomalley Man, there is so many layers to this one it incredible. quote:To: markomalley quote:To: markomalley [citation needed] quote:To: markomalley Civil War is Coming! quote:To: ClearCase_guy "So I'll put your response down as 'No,' then." quote:To: markomalley Have I said 'Chappaquiddick' yet? quote:To: markomalley This is good for Moore! quote:To: markomalley quote:To: ClearCase_guy Did somebody say Pizza... Gate? quote:To: markomalley I dunno, the GOP's response so far has been pretty offensive to me. quote:To: Sarah Barracuda quote:To: vette6387
|
# ? Nov 11, 2017 03:06 |
|
Da Mott Man posted:Can't wait to see what the scraper can find. Huh. I did not know that was a thing. I wrote my own. I have 1.6 million freeper comments in JSON format. I am planning to do... things with them.
|
# ? Nov 11, 2017 04:45 |
|
Jesus, the fixation on Kennedy. Dude died nearly a decade ago, and was out of office before I was even old enough to vote. They could not possibly try and distract themselves from the fact that they're supporting a pedophile more than they are now.
|
# ? Nov 11, 2017 05:00 |
|
Veni Vidi Ameche! posted:Huh. I did not know that was a thing. I wrote my own. I have 1.6 million freeper comments in JSON format. I am planning to do... things with them. Please do not create a chatbot Tay was risky enough
|
# ? Nov 11, 2017 05:15 |
|
GreyjoyBastard posted:Please do not create a chatbot On the other hand, create a chatbot and put it on twitter.
|
# ? Nov 11, 2017 05:30 |
|
Admiral Ray posted:On the other hand, create a chatbot and put it on twitter. I wonder what the record is for quickest twitter ban.
|
# ? Nov 11, 2017 05:32 |
|
Plinkey posted:I wonder what the record is for quickest twitter ban. nah it's a freeperbot it'll get verified
|
# ? Nov 11, 2017 05:32 |
|
Admiral Ray posted:On the other hand, create a chatbot and put it on twitter. I can neither confirm nor deny that this is what I plan to do with the two RNNs that I am currently training on separate EC2 GPU instances. Edit: On a completely unrelated note, I am taking suggestions for best Twitter libraries in Perl, PHP, Node, C++, C#, and Python. Veni Vidi Ameche! fucked around with this message at 05:44 on Nov 11, 2017 |
# ? Nov 11, 2017 05:41 |
|
Veni Vidi Ameche! posted:I can neither confirm nor deny that this is what I plan to do with the two RNNs that I am currently training on separate EC2 GPU instances. Did you take into account getting rid of quotes based on how freepers quote by copy/pasting?
|
# ? Nov 11, 2017 05:44 |
|
Plinkey posted:Did you take into account getting rid of quotes based on how freepers quote by copy/pasting? I preserve all html and formatting so the comments can be filtered however I please. I considered getting rid of quotes. After a bit of thought, I decided that the more often something was quoted, the more it reflected what Freep is really about, anyway. No point in trimming them out, if I'm right about that.
|
# ? Nov 11, 2017 05:47 |
|
Veni Vidi Ameche! posted:I preserve all html and formatting so the comments can be filtered however I please. I considered getting rid of quotes. After a bit of thought, I decided that the more often something was quoted, the more it reflected what Freep is really about, anyway. No point in trimming them out, if I'm right about that. Ah, ok I was stripping only relevant info from posts and threads from the html in memory and throwing them into elasticsearch to do something with them later. Did you have any issues with waiting between scrapes so that it doesn't look like your ip is scraping?
|
# ? Nov 11, 2017 05:49 |
|
Plinkey posted:Ah, ok I was stripping only relevant info from posts and threads from the html in memory and throwing them into elasticsearch to do something with them later. Did you have any issues with waiting between scrapes so that it doesn't look like your ip is scraping? I've been running HTTrack on them 24/7 for fourteen months from a single IP address. They're not real on top of their game. I just wrote code to process the data. I started dumping it into DynamoDB, but I decided I don't like NoSQL for this. code:
Edit II: I did configure HTTrack so it isn't obnoxiously hammering their servers, but I didn't make any attempts to disguise my traffic. Veni Vidi Ameche! fucked around with this message at 05:56 on Nov 11, 2017 |
# ? Nov 11, 2017 05:53 |
|
Ah, nice. I'll take out all of my waits then. Freep replies seem to lend themselves to ES pretty well, treat each one like a document and every field is searchable easy enough. I'm a defiantly an ES novice so I figured this would be a good way to get into it.
|
# ? Nov 11, 2017 06:01 |
|
Plinkey posted:Ah, nice. I'll take out all of my waits then. Yeah, don't sweat too much about timeouts, but keep your limits sane. We can't be the only two people on the planet scraping Freep, and I don't want to cripple their servers and/or get noticed. I don't remember my exact settings, but I'm using one thread, and generous timeouts and delays. If you don't have it, yet, get the Sense plugin for your browser. It's the best thing you can do for yourself as far as exploring ElasticSearch. It's so good, Elastic absorbed it, and it's now an official product of theirs. I'm not sure what they've done with it, but you can still find "Sense (Beta)" for Chrome.
|
# ? Nov 11, 2017 06:06 |
|
Looks like it's this now: https://www.elastic.co/guide/en/kibana/current/console-kibana.html I've been using something similar in cerebro.
|
# ? Nov 11, 2017 06:16 |
|
Plinkey posted:Looks like it's this now: https://www.elastic.co/guide/en/kibana/current/console-kibana.html That looks like it. It is (was?) a great tool. It makes querying your ElasticSearch server a breeze. Edit: quote:No matter what the people will kill anyway! Veni Vidi Ameche! fucked around with this message at 07:06 on Nov 11, 2017 |
# ? Nov 11, 2017 06:51 |
|
You don't have PMs or I'd message you, but did freep start their new number scheme at 1,000,000? I can't figure out how to go back farther than that.
|
# ? Nov 11, 2017 07:51 |
|
Plinkey posted:You don't have PMs or I'd message you, but did freep start their new number scheme at 1,000,000? I can't figure out how to go back farther than that. Yeah. I didn’t shell out for any upgrades. I am not scraping sequentally. I am spidering the entire site from the root, but I am staring to focus on /focus/*.
|
# ? Nov 11, 2017 08:39 |
|
Veni Vidi Ameche! posted:Yeah. I didn’t shell out for any upgrades. Ah, I am, so was looking for the base threadid kinda thing. I just kicked off a scraper starting at 1 mil so we'll see what it picks up, hopefully my 60 gig VM doesn't fill up overnight.
|
# ? Nov 11, 2017 08:46 |
|
ArchRanger posted:Jesus, the fixation on Kennedy. Dude died nearly a decade ago, and was out of office before I was even old enough to vote. They could not possibly try and distract themselves from the fact that they're supporting a pedophile more than they are now. That's their go to defense, like watching them hang Bill Clinton again despite him having nothing to do with politics, and still trying to lynck Hillary and Obama.
|
# ? Nov 11, 2017 08:59 |
|
Veni Vidi Ameche! posted:I preserve all html and formatting so the comments can be filtered however I please. I considered getting rid of quotes. After a bit of thought, I decided that the more often something was quoted, the more it reflected what Freep is really about, anyway. No point in trimming them out, if I'm right about that. That and they are incredibly horribly bad at having any sort of quoting consistency whatsoever. Each person does it differently, some add quotes, some add fancy quotes, some italicize, some use different characters as delimiters, and a handful get adventurous and add blockquotes. Also holy poo poo I haven't heard the name HTTrack in well over a decade.
|
# ? Nov 11, 2017 09:17 |
|
McGlockenshire posted:That and they are incredibly horribly bad at having any sort of quoting consistency whatsoever. Each person does it differently, some add quotes, some add fancy quotes, some italicize, some use different characters as delimiters, and a handful get adventurous and add blockquotes. I could probably build up a small libarary of regexes that would get most of it, but it doesn’t seem important. One thing I would like to do is filter out all the “Thanks for your donation!” posts. There are more than you might think, and I think he hand-writes them all, because they don’t seem to follow a consistent format. Believe it or not, HTTrack is still actively maintained. They had a release about six months ago, I think. It’s not the prettiest tool, but it’s mature, and it has served me well for many years. I could go with something like Selenium, but it’s a lot more code and would probably make little or no difference on this project. I keep all the raw HTML, so I can re-process everything from the ground up any time I want. If I decide to filter the donation posts or whatever, I just add some code to index.js and let ‘er rip.
|
# ? Nov 11, 2017 09:32 |
|
McGlockenshire posted:That and they are incredibly horribly bad at having any sort of quoting consistency whatsoever. Each person does it differently, some add quotes, some add fancy quotes, some italicize, some use different characters as delimiters, and a handful get adventurous and add blockquotes. one broke my scraper today, he literally post the entire html of another thread, so i picked up post number 60 when i was expecting 20
|
# ? Nov 11, 2017 09:36 |
|
Veni Vidi Ameche! posted:I keep all the raw HTML, so I can re-process everything from the ground up any time I want. To make up for the tech talk, here's Report: Alabama Woman Claims Reporter Offered Her $1000s to Accuse Roy Moore of Sexual Abuse?... posted by freeper blueyon Some rando on twitter said that someone told him that one of the women interviewed by the WaPo was paid. The twitter account in question looks and talks like a freeper. Anyway, Gateway Pundit decided randos have credibility and they ran with it. quote:To: blueyon "I choose to believe this because it is politically convenient and it fits my worldview." quote:To: Katya quote:To: blueyon freeper opinions: doxx the victims quote:To: Katya In case you were left with any doubt after the Jake Tapper thread, this is the freeper mindset right now: quote:To: blueyon I know that they didn't miss Roy Moore admitting to dating the other girls, so why do they hold this opinion? (Duh.) quote:To: blueyon quote:To: Chauncey Gardiner Yes they will take some rando on twitter as a real and serious source and believe them over a well-researched, well-sourced article in a reliable news outlet. quote:To: blueyon
|
# ? Nov 11, 2017 09:39 |
|
McGlockenshire posted:How much disk space does your archive take up? I've been meaning to write a thing that tried to track historical zots and other times when accounts go dead and I was trying to work with Plinkey to write a mutually beneficial scraper (though my attention has been elsewhere). I have 256 gigs allocated to it, and I think that’s around 82% full. I’m not near that machine, right now, so I can’t check. A lot of that space is temp files from HTTrack.
|
# ? Nov 11, 2017 10:32 |
|
Veni Vidi Ameche! posted:I have 256 gigs allocated to it, and I think that’s around 82% full. I’m not near that machine, right now, so I can’t check. A lot of that space is temp files from HTTrack. Well at least now we know where the original hate bot that wipes out humanity came from
|
# ? Nov 11, 2017 14:06 |
|
Why would you want to generate more freep
|
# ? Nov 11, 2017 14:14 |
|
Veni Vidi Ameche! posted:I can neither confirm nor deny that this is what I plan to do with the two RNNs that I am currently training on separate EC2 GPU instances. For python I like http://tweepy.readthedocs.io/en/v3.5.0/
|
# ? Nov 11, 2017 15:18 |
|
Does this really need a context?quote:To: governsleastgovernsbest Article title: Liberal Pastor Accuses Conservative Christians of “Policy Pedophilia” link: http://www.freerepublic.com/focus/f-news/3603775/posts#comment
|
# ? Nov 11, 2017 18:54 |
|
Interesting, so there are comments from this year on 14 year old threads in freep e: Wonder if it was roy moore's account
|
# ? Nov 11, 2017 20:09 |
|
Just hit 1mil, also I plan to eventually make this public for everyone to query for freep terribleness, and maybe a function to format posts to make it easier to post here.
|
# ? Nov 12, 2017 00:13 |
|
Icon Of Sin posted:Does this really need a context? I accidentally tapped someone's username while browsing that thread and found this: quote:
Freep has a board of directors?
|
# ? Nov 12, 2017 02:00 |
|
Why Would Kim Jong-Un Insult Me By Calling Me "old," When I would NEVER Call Him "Short And Fat?" posted by freeper Enlightened1 https://twitter.com/realDonaldTrump/status/929511061954297857 quote:Hilarious! quote:To: Enlightened1 quote:To: Enlightened1 quote:To: blu quote:To: blu quote:To: Enlightened1
|
# ? Nov 12, 2017 02:47 |
|
Nukes are going to be fired over schoolyard name calling.
|
# ? Nov 12, 2017 04:01 |
|
That's just a shittier version of a solid burn Reagan got on one of his opponents. Im shocked freep has low standards for comedy
|
# ? Nov 12, 2017 04:19 |
|
GreyjoyBastard posted:Please do not create a chatbot The libertarian thread ended up creating Jrodbot to mimic its star poster, and a quote from that bot is still the thread subtitle.
|
# ? Nov 12, 2017 04:27 |
|
|
# ? May 22, 2024 05:50 |
|
RagnarokAngel posted:That's just a shittier version of a solid burn Reagan got on one of his opponents. Im shocked freep has low standards for comedy Which one? Is it as good as ""Senator, you're no Jack Kennedy."
|
# ? Nov 12, 2017 05:01 |