Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Stubear St. Pierre
Feb 22, 2006

hyphz posted:

Is there a good clear explanation of how backpropagation works and why?

Backprop is a computer sciencey version of the chain rule from elementary calculus. If you (or anyone who ever reads this) don't know elementary calculus, here's a crash course:

- The derivative: If you take the slope of a line over an infinitely small interval, it tells you the "instantaneous rate of change" of a function. You can tell whether the function is increasing, decreasing, or staying the same (at a maximum or minimum value) based on whether that value is positive, negative, or 0.
- The derivative of F(X) is written F'(X).
- In machine learning, you're trying to minimize a loss function. So that's why you need a derivative--I know I'm at a minimum if my derivative is negative for a while, then turns to 0.
- If you have multiple variables, F(x, y, z, t, w, ...) then let's just say we call this a "gradient." It tells us how much we're changing in a whole bunch of directions at once.

So here's backpropagation:

1) if I have some poo poo, and multiply that poo poo by a bunch of matrices, that's a composite function--x times a matrix F, times a matrix G, times a matrix H etc can be written as F(G(H(x))). This is what deep learning does. All deep learning actually is are a graph of tensors and ops as the edges connecting those tensors.
1.5) let's assume we can calculate a derivative of a function on a computer pretty easily.
2) the chain rule: the derivative ("gradient", it's actually the Jacobian iirc but who cares) of a well-behaved* composite function, F(G(H(X)) is F'(G(H(X)) * G'(H(X)) * H'(X). There's a very easy proof/derivation of this actually, but you can live a rich and fulfilling life without ever bothering to look it up or verify it.

So, backpropagation:
- In deep learning, I'm doing (x * F ( * G ( * H))) where let's say our * operator boils down to matrix multiplication or something similar.
- I can rewrite that as F(G(H(x))).
- As I go through each step, I can compute a derivative, F', G', H' and set up placeholders for when I know what G(X) and H(X) are.
- I can then go backwards when I'm done, and start with H'(X), then plug H(X) into my placeholder for G'(H(X)), on and on and on.
- That's backpropagation, and I finally get something that behaves like a derivative from that, and that derivative will tell me if I'm heading in the right direction.

So if I calculate H'(?), G'(?), F'(?) as I go, where "?" is a placeholder, I can just shove the outputs of F(G(H(x))) (because I'm computing them sequentially) into those derivatives and "backpropagate" the gradients.

In Tensorflow this is done by creating a second backwards graph before running your stuff, in PyTorch every tensor you create has a "backwards" method that gets called by the autograd engine (remember, everything in TF/PT are tensors, the whole notion of "layers" is syntactic sugar).

*"well-behaved" in this context means continuously differentiable on the interval you're interested in or something

It really is that simple!

Adbot
ADBOT LOVES YOU

Stubear St. Pierre
Feb 22, 2006

Yeah the actual human brain works basically nothing like a deep neural network (to the extent we know how it works at all). Neural networks and "activations" get their name from superficial similarities.

The branch of CS devoted to simulating brains is called "neuromorphic" computing, that's basically the extent of my knowledge on it but there's a Wikipedia article about it https://en.wikipedia.org/wiki/Neuromorphic_engineering

Stubear St. Pierre
Feb 22, 2006

Rahu posted:

I've been trying to learn some ML stuff lately and to that end I've been reading over Andrej Karpathy's nanoGPT.

I think I have a pretty good grasp on how it works but I'm curious about one specific bit. The training script loads a binary file full of 16-bit ints that represent the tokenized input. It has a block of code that looks like this

https://github.com/karpathy/nanoGPT/blob/7fe4a099ad2a4654f96a51c0736ecf347149c34c/train.py#L116

code:
data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
What I'm curious about is: what is the purpose of doing `astype(np.int64)` here? The data is written out as 16 bit uints, then loaded as 16 bit uints, then reinterpreted as 64 bit ints when converting from numpy to pytorch and I just don't see what that achieves.

The forward method of their GPT model feeds that input through an nn.Embedding layer which requires torch.long (int64) input, so they're doing the conversion on the batch code because that will generally run on the CPU, or at the very least can be precomputed/queued, whereas a conversion further down the actual network in the Embedding layer will happen on the GPU.

Xun posted:

Anyone going to ICML? Im not presenting (rip) but I managed to get a travel grant for it anyway :shrug:

Honestly the major conferences are notorious for being a horseshit arbitrary process to get accepted, you really deserve respect for being able to get a travel grant these days

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply