Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Volguus
Mar 3, 2009

baka kaba posted:

Regular expressions seem like a good match for that? I'm guessing you have a bunch of inconsistent filenames, so you want a general rule for pulling out the bits of info you want, which are probably something like

  • The series title is at the start of the filename (maybe ignoring any leading symbols)
  • The year is 4 digits in parentheses, following the title
  • Series and episode are Sxx and Exx where those X's are digits
  • Everything else is trash?

You could write a regex that pulls those out, and do any processing you need on those bits (like fixing title spaces). You could use normal searching logic in your language, but regex can express a lot fairly simply. Depends how much variation you expect to see in your title formatting, and how much you want to handle instead of saying "can't do anything with this one"

regex101 is good to play around and test things

The problem with regex for his task is that he will need a ton of them. And maintaining them (adding/removing/updating) will be a full time job, since rarely 2 downloads' pattern will match. This is an open problem as far as I know and this is what newznab does. The only way to solve this is (in my opinion) via machine learning. How? No idea, yet.

Adbot
ADBOT LOVES YOU

luchadornado
Oct 7, 2004

A boombox is not a toy!

You could also look into parser combinators. Best bang for the buck would likely be a handful loose regexes that get you 98% of the way there, and then manual correction on that remaining 2%.

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

Volguus posted:

The problem with regex for his task is that he will need a ton of them. And maintaining them (adding/removing/updating) will be a full time job, since rarely 2 downloads' pattern will match. This is an open problem as far as I know and this is what newznab does. The only way to solve this is (in my opinion) via machine learning. How? No idea, yet.

Well it depends how robust it needs to be. Like you say it's an open problem and there are already solutions that have had a ton of time invested in developing and updating them - I'm guessing the OP just wants to create a simple tool that mostly gets the job done, and gets you to fix or specify things manually if it can't handle a particular string

like this
^(.+)\((\d{4})\).*S(\d{2}).*E(\d{2})
is a basic one that handles the example format with some leeway, makes those assumptions I said, and pulls out the relevant bits. Obviously it's not perfect - the title needs fixing, it doesn't handle combined episodes like E01-02, the year might not be in parentheses, etc. But it does a lot, it just depends how much variation you expect to see and where you draw the line for supporting them. And even doing it in code without regexes, you still have to write all this logic anyway. It's the scope of the problem that's the issue really

I guess the important thing is to say, don't try and write one regex that does everything. Write one that breaks things up into parts, have a few fallback ones if you need to handle awkwardly different formats, then you can handle those parts with their own parsing logic

luchadornado
Oct 7, 2004

A boombox is not a toy!

baka kaba posted:

Write one that breaks things up into parts, have a few fallback ones if you need to handle awkwardly different formats, then you can handle those parts with their own parsing logic

That's a parser combinator. Plenty of libraries for them, like: https://pythonhosted.org/parsec/

Volguus
Mar 3, 2009

Helicity posted:

That's a parser combinator. Plenty of libraries for them, like: https://pythonhosted.org/parsec/

Hmm, looking at it, yes, it looks like a parser combinator may be the golden ticket. Or, at least golden enough. Not as hip as machine learning but oh well....

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


It may not be as hip as machine learning, but it's much more likely to not break in very unexpected ways and to be easy to fix if something does go wrong.

RPATDO_LAMD
Mar 22, 2013

🐘🪠🍆
There are a couple of open source "personal media server" projects that already do this. (Somehow extract series from the filename and identify the episode title etc from an api lookup on TVDB or whatever.) You might want to check Kodi or Plex and see what you can crib off of them.

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

Helicity posted:

That's a parser combinator. Plenty of libraries for them, like: https://pythonhosted.org/parsec/

That's cool, although... do you have to look at the Haskell docs to find out how to use it?

I do like me some functional programming, but for the original question I was thinking more like "here's the most common formats as regexes, just loop over until one of them matches, then do this stuff with the groups". Pretty basic, learning regex is enough if you've never done it before! You can always get fancy later. Some of these libraries look nice and fluent though, I'll have to try one out sometime

goodness
Jan 3, 2012

just keep swimming
I think parsing is the way to go for my use. 99% of the files are going to have the needed info in the title, so it will be a matter of splitting out what I need and recombining.

I found a program called filebot but I couldn't make much sense of the github and its in java which I don't know at all. https://github.com/filebot/filebot

My goal is to complete a few projects alongside taking Data Structures. I have an internship lined up at an aerospace company for this summer but my resume will be pretty blank regarding CS besides that. The other 2 projects I am working on are a Reef/Aquarium controller through RaspberryPI and making a MIDI controller with an arduino.

tef
May 30, 2004

-> some l-system crap ->

22 Eargesplitten posted:

Can someone explain in idiot terms what RESTful means?

Here's a long post that tries to explain what the hell is REST, anyway.

Originating in a thesis, REST is an attempt to explain what makes the browser distinct from other networked applications.

You might be able to imagine a few reasons why: there's tabs, there's a back button too, but what makes the browser unique is that a browser can be used to check email, without knowing anything about POP3 or IMAP.

Although every piece of software inevitably grows to check email, the browser is unique in the ability to work with lots of different services without configuration—this is what REST is all about.

HTML only has links and forms, but it's enough to build incredibly complex applications. HTTP only has GET and POST, but that's enough to know when to cache things, HTTP uses URLs, so it's easy to route messages to different places too.

Unlike almost every other networked application, the browser is remarkably interoperable. The thesis was an attempt to explain how that came to be, and called the resulting style REST.

REST is about having a way to describe services (HTML), to identify them (URLs), and to talk to them (HTTP), where you can cache, proxy, or reroute messages, and break up large or long requests into smaller interlinked ones too.

How REST does this isn't exactly clear.

The thesis breaks down the design of the web into a number of constraints—Client-Server, Stateless, Caching, Uniformity, Layering, and Code-on-Demand—but it is all to easy to follow them and end up with something that can't be used in a browser.

REST without a browser means little more than "I have no idea what I am doing, but I think it is better than what you are doing.", or worse "We made our API look like a database table, we don't know why". Instead of interoperable tools, we have arguments about PUT or POST, endless debates over how a URL should look, and somehow always end up with a CRUD API and absolutely no browsing.

There are some examples of browsers that don't use HTML, but many of these HTML replacements are for describing collections, and as a result most of the browsers resemble file browsing more than web browsing. It's not to say you need a back and a next button, but it should be possible for one program to work with a variety of services.

For an RPC service you might think about a `curl` like tool for sending requests to a service:

code:
$ rpctl [url]http://service/[/url] describe MyService
methods: ...., my_method

$ rpctl [url]http://service/[/url] describe MyService.my_method
arguments: name, age

$ rpctl [url]http://service/[/url] call MyService.my_method --name="james" --age=31
Result:
   message: "Hello, James!"
You can also imagine a single command line tool for a databases that might resemble `kubectl`:

code:
$ dbctl [url]http://service/[/url] list ModelName --where-age=23
$ dbctl [url]http://service/[/url] create ModelName --name=Sam --age=23
$ ...
Now imagine using the same command line tool for both, and using the same command line tool for _every_ service—that's the point of REST. Almost.


code:
$ apictl call MyService:my_method --arg=...
$ apictl delete MyModel --where-arg=...
$ apictl tail MyContainers:logs --where ...
$ apictl help MyService
You could implement a command line tool like this without going through the hassle of reading a thesis. You could download a schema in advance, or load it at runtime, and use to create requests and parse responses.

REST is quite a bit more than being able to reflect, or describe a service at runtime. The REST constraints require using a common fomat for the contents of messages so that the command line tool doesnt need configuring, require sending the messages in a way that allows you to proxy, cache, or reroute them without fully understanding their contents.

REST is also a way to break apart long or large messages up into smaller ones linked together—something far moe than just learning what commands can be sent at runtime, but allowing a response to explain how to fetch the next part in sequence.

To demonstrate, take an RPC service with a long running method call:

code:
class MyService(Service):
    @rpc()
    def long_running_call(self, args: str) -> bool:
        id = third_party.start_process(args)
        while third_party.wait(id):
            pass
        return third_party.is_success(id)
When a response is too big, you have to break it down into smaller responses. When a method is slow, you have to break it down into one method to start the process, and another method to check if it's finished.

code:
class MyService(Service):
    @rpc()
    def start_long_running_call(self, args: str) -> str:
         ...
    @rpc()
    def wait_for_long_running_call(self, key: str) -> bool:
         ...

In some frameworks you can use a streaming API instead, but replacing streaming involves adding heartbeat messages, timeouts, and recovery, so many developers opt for polling instead, breaking the single request into two.

Both approaches require changing the client and the server code, and if another method needs breaking up you have to change all of the code again.

REST offers a different approach. We return a response that describes how to fetch another request, much like a HTTP redirect. In a client library, you could imagine handling these responses, much like an HTTP client handles redirects too.

code:
def long_running_call(self, args: str) -> bool:
    key = third_party.start_process(args)
    return Future("MyService.wait_for_long_running_call", {"key":key})

def wait_for_long_running_call(self, key: str) -> bool:
    if not third_party.wait(key):
        return third_party.is_success(key)
    else:
        return Future("MyService.wait_for_long_running_call", {"key":key})
code:
def fetch(request):
   response = make_api_call(request)
   while response.kind == 'Future':
       request = make_next_request(response.method_name, response.args)
       response = make_api_call(request)


For the more typed among you, please ignore the type annotations, pretend I called `raise Future(....)`, or imagine a `Result<bool>` that's a supertype. For the more operations minded, imagine I call `time.sleep()` inside the client, and maybe imagine the Future response has a duration inside.

The point is that by allowing a response to describe the next request in sequence, we've skipped over the problems of the other two approaches—we only need to implement the code once in the client. When a different method needs breaking up, you can return a `Future` and get on with your life.

In some ways it's as if you're returning a callback to the client, something the client knows how to run to produce a request. With `Future` objects, it's more like returning values for a template. This approach works for paginating too—breaking up a large response into smaller ones.

Pagination often looks something like this in an RPC system:

code:
cursor = rpc.open_cursor()
output = []
while cursor:
    output.append(cursor.values)
    cursor = rpc.move_cursor(cursor.id)
Or something like this:

code:
start = 0
output = []
while True:
    out = rpc.get_values(start, batch=30)
    output.append(out)
    start += len(out)
    if len(out) < 30:
        break
Iterating through a set of responses means keeping track of how far you've gotten so far. The first pagination example stores state on the server, and gives the client an Id to use in subsequent requests. The second pagination example stores state on the client, and constructs the correct request to make from the state.

Like before, REST offers a third approach. The server can return a `Cursor` response, much like a `Future`, with a set of values and a request message to send for the next chunk, and the client pages through the responses to build a list of values:

code:
cursor = rpc.get_values()
output = []
while cursor:
    output.append(cursor.values)
    cursor = cursor.move_next()
code:
class ValueService(Service):
    @rpc()
    def get_values(self):
        return Cursor("ValueService.get_cursor", {"start":0, "batch":30}, [])

    @rpc
    def get_cursor(start, batch):
        ...
        return Cursor("ValueService.get_cursor", {"start":start, "batch":batch}, values)
The REST approach offers something different. The state is created on the server, sent back to the client, and then sent back to the server. If a Server wants to, it can return a `Cursor` with a smaller set of values, and the client will just make more requests to get all of them.

`Future` and `Cursor` aren't the only kind we can parameterise—a `Service` can contain state to pass into methods, too.

To demonstrate why, imagine some worker that connects to a service, processes work, and uploads the results. The first attempt at server code might look like this:

code:
class WorkerApi(Service):
    def register_worker(self, name: str) -> str
        ...
   def lock_queue(self, worker_id:str, queue_name: str) -> str:
        ...
   def take_from_queue(self, worker_id: str, queue_name, queue_lock: str):
       ...
   def upload_result(self, worker_id, queue_name, queue_lock, next, result):
       ...
   def unlock_queue(self, worker_id, queue_name, queue_lock):
       ...
   def exit_wotker(self, worker_id):
       ...
Unfortunately, the client code looks much nastier:

code:
worker_id = rpc.register_worker(my_name)
lock = rpc.lock_queue(worker_id, queue_name)
while True:
    next = rpc.take_from_queue(worker_id, queue_name, lock)
    if next:
        result = process(next)
        rpc.upload_result(worker_id, queue_name, lock, next, result)
    else:
        break
rpc.unlock_queue(worker_id, queue_name, lock)
rpc.exit_worker(worker_id)
Each method requires a handful of parameters, relating to the current session open with the service. What we'd rather use is some API where the state between requests is handled for us:

code:
lease = rpc.register_worker(my_name)

queue = lease.lock_queue(queue_name)

while True:
    next = queue.take_next() 
    if next:
        next.upload_result(process(next))
    else:
        break
queue.unlock()
lease.expire()
The traditional way to achieve this is to build these wrappers by hand—creating special code on the client to wrap the responses, and call the right methods. If we can link together a large response, we should be able to link together the requests, and pass the state from one to the next just like a `Cursor` does.

Instead of one service, we now have four. Instead of returning identifiers to pass back in, we return a `Service` with those values filled in for us:

code:
class WorkerApi(Service):
    def register(self, worker_id):
        return Lease(worker_id)

class Lease(Service):
    worker_id: str

    @rpc()
     def lock_queue(self, name):
        ...
        return Queue(self.worker_id, name, lock)

class Queue(Service):
    name: str
    lock: str
    worker_id: str

    @rpc()
     def get_task(self):
        return Task(.., name, lock, worker_id)

class Task(Service)
    task_id: str
    worker_id: str

    @rpc()
     def upload(self, out):
        mark_done(self.task_id, self.actions, out)
The client code looks like the desired example aboce—instead of an id string, the client gets a 'Service' response, methods included, but with some state hidden inside. The client turns this into a normal service object, and when the methods get called, that state is added back into the request. You can even add new parameters in, without changing too much of the client code.

Although the `Future` looked like a callback, returning a `Service` feels like returning an object. This is the power of self description—unlike reflection where you can specify in advance every request that can be made—each response has the opportunity to define what new requests can be made.

It's this navigation through several linked responses that distinguishes a regular command line tool from one that browses—and where REST gets its name.

The passing back and forth of requests from server to client is where the 'state-transfer' part of REST comes from, and using a common `Result` or `Cursor` object is where the 'representational' comes from. Although a RESTful system is more than just these combined. Along with a reusable browser, you have reusable proxies.

In the same way that messages describe things to the client, they describe things to the proxy too. Using GET or POST, and distinct URLs is what allows caches to work across services. using a stateless protocol (HTTP) is what allows proxying to work so effortlessly. The trick with REST is that despite HTTP being stateless, and despite HTTP being simple, you can build complex, stateful services by threading the state invisibly between smaller messages.

Although the point of REST is to build a browser, the point is to using self-description and state-transfer to allow heavy amounts of interoperation—not just a reusable client, but reusable proxies, caches, or load balancers too. Going back to the constraints, you might be able to see how they things fit together to achieve this.

Client-Server, Stateless, Caching, Uniformity, Layering and Code-on-Demand. The first, Client-Server, feels a little obvious, but sets the background. A server waits for requests from a client, and issues responses.

The second, Stateless, is a little more confusing. If a HTTP proxy had to keep track of how requests link together, it would involve a lot more memory and processing. The point of the stateless constraint is that to a proxy, each request stands alone. The point is also that any stateful interactions should be handled by linking messages together.

Caching is the third constraint, and it's back to being obvious. Requests and Responses must have some description as to if the request must be resent or if the response can be resent. The fourth constraint, Uniformity, is the most difficult, so we'll cover it last. Layering is the fith, and it means "You can proxy it".

Code-on-demand is the final, optional, and most overlooked constraint, but it covers the use of Cursors, Futures, or Parameterised Services—the idea that despite using a simple means to describe services or responses, they can be used, or run, to create new requests to send. Code-on-demand takes that further, and imagines passing back code, rather than templates and values to assemble.

With the other constraints handled, it's time for uniformity. Like statelessness, this constraint is more about HTTP than it is about the system atop, and frequently misapplied. This is the reason why people keep making database APIs and calling them RESTful, but the constraint has nothing to do with CRUD.

The constraint is broken down into four ideas: self-descriptive messages, identification of resources, manipulation of resources through representations, hypermedia as the engine of application state, and we'll take them one by one.

Self-Description is at the heart of REST, and this subconstraint fills in the gaps between the Layering, Caching, and Stateless constraints. Sort-of. It means using 'GET' and 'POST' to indicate to a proxy how to handle things, and responses indicate if they can be cached. It also means using a `content-type` header.

The next subconstraint, identification, means using different URLs for different services. In the RPC examples above, it means having a common, standard way to address a service or method, as well as one with parameters. This ties into the next constraint, which is about using standard representations across services. This doesn't mean using special formats for every API request, but using the same underlying language to describe every response. The web works because everyone uses HTML.

Uniformity so far might as well mean use HTTP (self-description), URLs (identification) and HTML (manipulation through representations), but it's the last subconstraint thats causes most of the headaches: Hypermedia as the engine of application state.

This is a fancy way of talking about how large or long requests can be broken up into interlinked messages, or how a number of smaller requests can be threaded together, passing the state from one to the next. Hypermedia referres to using `Cursor`, `Future`, or `Service` objects, application state is the details passed around as hidden arguments, and being the 'engine' means using it to tie the whole system together.

Together they form the basis of the Representational State transfer style. More than half of these constraints can be satisfied by just using HTTP—and if you digital into the thesis, you'll discover that the other half isn't about picking the right URLs, or using PUT or PATCH, but hiding those details from the end user.

REST at the end of the day is no more than a very long answer to explain what makes a web browser different, the latin name for opening a link in a new tab in a browser—the state is the part of the application you want to open (like your inbox), the repesentaton of the state is the URL in the link you're clicking on, and the transfer describes the whole back-and-forth of downloading the HTML with that URL in, and then requesting it.

Representational state transfer is what makes the back button work, why you can refresh a broken page and not have to restart from the beginning, and why you can open links in a new tab. (Unless it's twitter, where breaking the back button is but a footnote in the list of faults)

If you now find yourself understanding REST, I'm sorry. You're now cursed. Like a cross been the greek myths of Cassandra and Prometheus, you will be forced to explain the ideas over and over again to no avail. The terminology has been utterly destroyed to the point it has less meaning than 'Agile'.

Despite the well being throuroughly poisioned, these ideas of interoperability, self-description, and interlinked requests are surpisingly useful—you can break up large or slow responses, you can to browse or even parameterise services, and you can do it in a way that lets you re-use tools across services.

I haven't covered everything—there are still a few more tricks. Although a RESTful system doesn't have to offer a database like interface, it can. Along with `Service` or `Cursor`, you could imagine `Model` or `Rows` objects to return. For collection types, another trick is inlining.

Along with returning a request to make, a server can embed the result inside. A client can skip the network call and work directly on the inlined response. A server can even make this choice at runtime, opting to embed if the message is small enough.

Finally, with a RESTful system, you should be able to offer things in different encodings, depending on what the client asks for—even HTML. If you can build a reusable command line tool, generating a web interface isn't too difficult—at least this time you don't have to implement the browser from scratch.

In the end, being RESTful probably doesn't matter, your framework should take care of the details for you.

If interoperability, or common tools matter, then you might not care about the implementation details, but you should expect a little more from a RESTful system than just create, read, update and delete.

tef fucked around with this message at 09:21 on Jan 7, 2019

tef
May 30, 2004

-> some l-system crap ->

prisoner of waffles posted:

he's not trying to play this role but his shticks include having a really good understanding of several ideas that enough programmers think they can explain back to him despite not understanding, ergo periodically we get queuechat or RESTchat.

I'm an old man with anger issues, also suspicious dish is a jerk, but anyway, hopefully this fills in some of the gigantic holes in my earlier rushed attempt to condense a thesis

tef fucked around with this message at 07:03 on Jan 7, 2019

SAVE-LISP-AND-DIE
Nov 4, 2010
What's the least worst way of accepting docx files from users?

Edit: I mean, how can I safely edit untrusted docx files? Are macros going to gently caress my server up?

SAVE-LISP-AND-DIE fucked around with this message at 12:18 on Jan 7, 2019

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Open them with macros disabled? Do it in a VM if you really want to be careful?

Dren
Jan 5, 2001

Pillbug

strange posted:

What's the least worst way of accepting docx files from users?

Edit: I mean, how can I safely edit untrusted docx files? Are macros going to gently caress my server up?
If you care enough about being safe to spend $$$ there are products that can sanitize office files. Basically, rip apart the file and rebuild it with anything dangerous removed. (This may or may not ruin your file).

mystes
May 31, 2006

strange posted:

What's the least worst way of accepting docx files from users?

Edit: I mean, how can I safely edit untrusted docx files? Are macros going to gently caress my server up?
What are you doing with the files? Opening them with office is asking for trouble but just manipulating the OOXML data directly is probably relatively safe.

Volguus
Mar 3, 2009

strange posted:

What's the least worst way of accepting docx files from users?

Edit: I mean, how can I safely edit untrusted docx files? Are macros going to gently caress my server up?

If what you want is very simple (update the contents of the 5th <w:t> element in the second table with id 55), then using any XML library is fine. If you want a bit more than that, or at least don't know the structure of the document that precisely, then you should use a library. The dumber the library, the less features it has, the safer it is.

Dominoes
Sep 20, 2007

Does anyone know of any resources that track financial news article performance? Doing a Google search for "financial news article track records" doesn't produce relevant results. A random scan of a few financial news articles at any point appears to produce a few vague, but quantifiable predictions over time spans ranging from a week to a year. I'm suspicious that all predictions, including ones from respected sources, are no better than random, since my understanding from statistics is that any better-than-random guess can be leveraged into large profits using derivatives. This includes price changes (or lack-thereof) of any kind.

I'm posting here instead of a finance thread, since I suspect the replies will be less biased. May crosspost later.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Dominoes posted:

Does anyone know of any resources that track financial news article performance? Doing a Google search for "financial news article track records" doesn't produce relevant results. A random scan of a few financial news articles at any point appears to produce a few vague, but quantifiable predictions over time spans ranging from a week to a year. I'm suspicious that all predictions, including ones from respected sources, are no better than random, since my understanding from statistics is that any better-than-random guess can be leveraged into large profits using derivatives. This includes price changes (or lack-thereof) of any kind.

I'm posting here instead of a finance thread, since I suspect the replies will be less biased. May crosspost later.

There are any number of (paid) services that do things like scrape news sources for company names and push out a feed with the time stamp, ticker and some kind of sentiment score - the usual suspects in the financial data world (Thomson Reuters, Bloomberg, Factset, Nasdaq...) all offer something like this.

If you’re asking whether someone has put together a comprehensive historical dataset of financial news stories, combined this with (expensive) intraday equities data and made the results freely available on the internet, then the answer is no, not that I’m aware of.

luchadornado
Oct 7, 2004

A boombox is not a toy!

DoctorTristan posted:

There are any number of (paid) services that do things like scrape news sources for company names and push out a feed with the time stamp, ticker and some kind of sentiment score - the usual suspects in the financial data world (Thomson Reuters, Bloomberg, Factset, Nasdaq...) all offer something like this.

If you’re asking whether someone has put together a comprehensive historical dataset of financial news stories, combined this with (expensive) intraday equities data and made the results freely available on the internet, then the answer is no, not that I’m aware of.

Having worked on a system that did exactly that - there were very few competitors in the field, and it would not be something you'd give away for free. It's Big Data territory and requires expensive licensing from any 3rd parties.

pokeyman
Nov 26, 2006

That elephant ate my entire platoon.

tef posted:

Here's a long post that tries to explain what the hell is REST, anyway.

This was interesting and well-written and I appreciated it, thanks!

RPATDO_LAMD
Mar 22, 2013

🐘🪠🍆

DoctorTristan posted:

There are any number of (paid) services that do things like scrape news sources for company names and push out a feed with the time stamp, ticker and some kind of sentiment score - the usual suspects in the financial data world (Thomson Reuters, Bloomberg, Factset, Nasdaq...) all offer something like this.

If you’re asking whether someone has put together a comprehensive historical dataset of financial news stories, combined this with (expensive) intraday equities data and made the results freely available on the internet, then the answer is no, not that I’m aware of.

It sounds like he's asking for a set of articles that make concrete measurable predictions, and then a dataset showing how accurate those predictions turned out to be.
Not a comprehensive list of every "Apple is good"/"apple is bad" blog alongside a graph of the apple stock price.

Dominoes
Sep 20, 2007

RPATDO_LAMD posted:

It sounds like he's asking for a set of articles that make concrete measurable predictions, and then a dataset showing how accurate those predictions turned out to be.
Not a comprehensive list of every "Apple is good"/"apple is bad" blog alongside a graph of the apple stock price.
Nailed it! It wouldn't be too difficult, but I don't think it would be worth the effort.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Dominoes posted:

Nailed it! It wouldn't be too difficult,

Systematically identifying and extracting stock predictions from news articles is a non-trivial NLP problem. Even after that the question of ‘was it accurate?’ is not always obvious (Over what timeframe? Relative to what benchmark?)

Such a study would involve several months work by a skilled team, plus (probably) licensing a few proprietary software libraries and datasets. That is expensive and finance is not an industry people go into in order to give valuable work away for free.

redleader
Aug 18, 2005

Engage according to operational parameters
I'm starting to get the feeling that REST is like monads, but even fewer people actually understand what Fielding was getting at.

luchadornado
Oct 7, 2004

A boombox is not a toy!

redleader posted:

I'm starting to get the feeling that REST is like monads, but even fewer people actually understand what Fielding was getting at.

A monad is just a monoid in the category of endofunctors.

taqueso
Mar 8, 2004


:911:
:wookie: :thermidor: :wookie:
:dehumanize:

:pirate::hf::tinfoil:

How hard is it to use Google Firebase to make a phone app that can take a photo, read a qr code, allow the user to check some boxes/enter some text, then throw the pic + data into a database. This stuff sounds like magic, is it actually?

JawnV6
Jul 4, 2004

So hot ...
Why are y'all inventing a NLP problem whole cloth? Look at individual analyst calls to buy/sell/hold. Jim Cramer has a % accuracy that doesn't rely on fuzzy sentiment analysis.

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
Have you considered that hyperbitcoinization will render securities denominated in legacy fiat currencies obsolete?

Dominoes
Sep 20, 2007

DoctorTristan posted:

Systematically identifying and extracting stock predictions from news articles is a non-trivial NLP problem. Even after that the question of ‘was it accurate?’ is not always obvious (Over what timeframe? Relative to what benchmark?)

Such a study would involve several months work by a skilled team, plus (probably) licensing a few proprietary software libraries and datasets. That is expensive and finance is not an industry people go into in order to give valuable work away for free.
Valid. I was thinking on a relatively small scale, but to infer any meaning would require lots of data. My initial thought was a fun (but long term) proj to demonstrate why financial news is BS.

pokeyman
Nov 26, 2006

That elephant ate my entire platoon.

taqueso posted:

How hard is it to use Google Firebase to make a phone app that can take a photo, read a qr code, allow the user to check some boxes/enter some text, then throw the pic + data into a database. This stuff sounds like magic, is it actually?

Not hard. It is not actually music.

drainpipe
May 17, 2004

AAHHHHHHH!!!!
The vim thread has been archived, and I have just a short question. In the last month or so, I've been getting a weird issue. Sometimes, I'll press a button and the button's character will appear on screen but with a yellow background. It doesn't seem like a character that's actually in the file because I can't delete it (even if I try to delete the row). If I save, quit, and reload, it'll be gone. Anyone know what is going on? It doesn't seem like a serious issue, but it's still annoying as gently caress and I can't seem to reproduce it.

TooMuchAbstraction
Oct 14, 2012

I spent four years making
Waves of Steel
Hell yes I'm going to turn my avatar into an ad for it.
Fun Shoe
Do you have a modifier key pressed? vim may be saying "okay, you pressed meta-k, I'm waiting for the next character to decide what you actually meant to input". Maybe fire up a keyboard input visualizer (I don't know anything about that one, it just turned up in Google) and see if it shows any unexpected keys as being depressed.

General_Failure
Apr 17, 2005
I need some advice. I have been working with part of a large source tree. Dealing with updates to it can be slow and clunky because I have to keep a clean tree for pushes and a working tree to test it. Besides object files etc. it sucks components of a commercial toolchain and a bunch of other stuff into itself during the initial preparation phase.

Ideally what I'd like to do is compare the trees to work out exactly where it's putting everything then going hog wild with a shitton of .gitignore files so I don't need two drat trees and a heap of copypasted files. Or is this not done? I've been bumbling along for a couple of years like that and it sucks.

luchadornado
Oct 7, 2004

A boombox is not a toy!

General_Failure posted:

I need some advice. I have been working with part of a large source tree. Dealing with updates to it can be slow and clunky because I have to keep a clean tree for pushes and a working tree to test it. Besides object files etc. it sucks components of a commercial toolchain and a bunch of other stuff into itself during the initial preparation phase.

Ideally what I'd like to do is compare the trees to work out exactly where it's putting everything then going hog wild with a shitton of .gitignore files so I don't need two drat trees and a heap of copypasted files. Or is this not done? I've been bumbling along for a couple of years like that and it sucks.

I have so many questions about this, but it sounds very snowflakish, which is not the path to repeatable, deterministic, and automatic builds and tests. Cover your code with unit and functional tests to give you sufficient confidence and then have some sort of CI process with Jenkins or Drone or whatever run your integration tests with the "commerical toolchain and other stuff".

"I have to keep a clean tree for pushes and a working tree to test it" - I have no idea what constraints have led to this point, but this feels like the problem.

General_Failure
Apr 17, 2005

Helicity posted:



"I have to keep a clean tree for pushes and a working tree to test it" - I have no idea what constraints have led to this point, but this feels like the problem.

Hmm. I think "30 year old codebase" is the most succinct answer.

luchadornado
Oct 7, 2004

A boombox is not a toy!

General_Failure posted:

Hmm. I think "30 year old codebase" is the most succinct answer.

Yikes. It's not really a great answer, but if you want the benefits of a modern CI/CD work flow, you have to modernize your codebase.

code:
if propensity_to_scare_away_new_hires + increased_productivity + personal_sanity > any_other_value_you_could_add { modernize() }
You could write some bespoke tool to diff the DAGs, autogenerate .gitignore, and some other things to make your life easier, but you should be honest that you're just punting the problem to the next guy or future you. Those types of hacks tend to accumulate and make it harder to modernize in the long run. Alternately, bail now and make it someone else's problem isn't the worst thing (I've done it before).

asur
Dec 28, 2012

Helicity posted:

I have so many questions about this, but it sounds very snowflakish, which is not the path to repeatable, deterministic, and automatic builds and tests. Cover your code with unit and functional tests to give you sufficient confidence and then have some sort of CI process with Jenkins or Drone or whatever run your integration tests with the "commerical toolchain and other stuff".

"I have to keep a clean tree for pushes and a working tree to test it" - I have no idea what constraints have led to this point, but this feels like the problem.

He doesn't have .gitignore setup so it tries to push all files. The solution is to setup the file and it's probably faster to just try and push since that will automatically diff. The vast majority of files and folders will fall under a few generic rules like ignoring the build folders.

I think you guys are making this much harder than it is. You don't need to autogen the file as it should rarely be updated since is accepts regexs and a few can cover 99.9% of the files with a few one offs.

asur fucked around with this message at 17:06 on Jan 14, 2019

luchadornado
Oct 7, 2004

A boombox is not a toy!

If they've been dealing with this for years on a 30 y/o chunk of code, I'm assuming it's beyond a simple .gitignore change. Onus is on general_failure to prove otherwise, I guess.

General_Failure
Apr 17, 2005
The only thing I'm using their main tree for is pulling components I need from. It's a CVS tree :(

My port is working from my own tree which is based off the source for a port to another platform that someone did plus some other downloaded components. There's more to it than that, but as explanations go it'll suffice.
I'm using Git.
Last night I had some realisations. I did a couple of diffs between my clean and dirty trees, dumping them to file. The brief was about 600k and the normal one was about 1.2MB. Two trees it is. I did also realise something else. I recently sorted out the kinks of shifting from SVN to Git and did so. I just realised I can have my remote git repo for the clean tree, and have a dirty tree using the local clean tree as the repo. Push from the clean tree, pull to the dirty tree.

I'm not doing this for money btw. It's open source code. I'm just doing it to stop my brain from rotting really. Right now I'm rebasing my code because my old tree got damaged through various transitions. Plus it's a couple of years out of date and the licensing model changed. It seems I'm only one finicky module away from having a uBoot image build again. Probably won't boot because I know I did something weird before I had to stop last year.

btw, it may have some ancient and creaky underpinnings but there's at least 500MB of source in any platform build tree. I hate to think how big the whole CVS tree is.

I can't link to my current tree because it's a private tree on GitLab so I can hide my shame. I need to find some files which I rewrote but never got to integrate and remove a mass of commented out unneeded / failure code.

Adbot
ADBOT LOVES YOU

His Divine Shadow
Aug 7, 2000

I'm not a fascist. I'm a priest. Fascists dress up in black and tell people what to do.
Anyone here ever ran a reverse proxy on IIS for a Wordpress/linux site? It's what I'm doing right now and I got some weird problems with AJAX post requests, just result in a reset connection so certain parts of the wordpress site (admin pages) aren't loading.

Not sure, maybe I posted this in the wrong place, I am at that stage where my head feels like mush...

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply