Python information and short questions megathread.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »

BigRedDot: Mar 6, 2008

Dominoes posted:

Allow for Python function annotations.

Method one: Ignore them, like the python interpreter does, instead of throwing an error. Should be easy to implement.

Method two: Use them as an alternative syntax for function signatures, like with mypy.

I think the main issue is that the mypy's type annotations are not generic or extensible enough (currently) to accommodate all the cases numba handles, so there is not alot of push in that direction. Maybe someday, as a subset, if there is enough request for it.

# ? Mar 23, 2015 15:08

Adbot: ADBOT LOVES YOU

# ? May 9, 2024 08:47

Dominoes: Sep 20, 2007

SurgicalOntologist posted:

I just wanted to point out that it is possible to compute variance in a single pass.

That wikipedia function for single-pass variance , with an added numba decorator, works twice as fast as the two-pass method I was using. (Pass 1 is finding the mean)

# ? Mar 23, 2015 15:18

Dominoes: Sep 20, 2007

BigRedDot posted:

I think the main issue is that the mypy's type annotations are not generic or extensible enough (currently) to accommodate all the cases numba handles, so there is not alot of push in that direction. Maybe someday, as a subset, if there is enough request for it.

Should still be able to ignore them. (Not just mypy - any valid Python func annotation).

Another q - might be more of a C / memory allocation question. What's the best way to trim empty values from an array?

I'm making a substitute for np.nonzero:

Python code:

   result_i = 0
    result = np.empty(M, dtype=np.int)
    for i in range(M):
        if diffs[i] != 0:
            result[result_i] = i
            result_i += 1
    return result

I'm setting up the array to its maximum size. I get a valid (and faster) result than np.nonzero, but it has a bunch of 0s and arbitrary numbers hanging off the end. Is there a fast way to trim it, or a better way to approach? Initializing the array with 0s and using np.trim_zeros is far too slow. I don't expect much/any speed increase over numpy's func, but I'm curious academically.

I tried this:

Python code:

    result_i = 0  # index, and also serves as size for new array.
    result = np.empty(M, dtype=np.int)
    for i in range(M):
        if diffs[i] != 0:
            result[result_i] = i
            result_i += 1

    result_trimmed = np.empty(result_i, dtype=np.int)
    for i in range(result_i):
        result_trimmed[i] = result[i]

    return result_trimmed

, but it ends up taking longer than np.nonzero.

Related question to Dot: Do you know why using np.empty, np.zeros etc will cause Numba to fail when using nopython=True?

Python code:

@numba.jit(nopython=True)
def test():
    np.empty(50)
    return 2


...

UntypedAttributeError: Failed at nopython (nopython frontend)
Unknown attribute "empty" of type Module(<module 'numpy' from 
'C:\\Users\\David\\Anaconda3\\lib\\site-packages\\numpy\\__init__.py'>)

Dominoes fucked around with this message at 17:33 on Mar 23, 2015

# ? Mar 23, 2015 16:38

salisbury shake: Dec 27, 2011

jimcunningham posted:

Still a programming newbie. Would like some help with enumerate(). When I try to learn stuff on my own, it's hard to follow because of the different variables they are using.
I'm reading a txt file into a list. I need to number each line. Then print even numbered lines. I made it work like this, but I have a feeling that enumerate works better. ( I discovered it afterwards)
code:
or_file = open("test5.txt", 'r')
outtext = or_file.readlines()
or_file.close()
f_text = list()
x = 1
newtxt = list()

for line in outtext:
	y = str(x)
	newtxt.append(y + " " + line)
	x = x + 1

for line in newtxt:
	f = int(line[0])
	
	if (f % 2 == 0):
		f_text.append(line[:])
		
	else: continue
	
print f_text

enumerate returns an iterable of tuples of the format -> (0, item1), (1, item2), (2, item3) and so on
Python has a feature called tuple unpacking that allows us to do things like this

Python code:

x, y = (1, 2)

for x, y in ((1,2), (3,4)):
    print(x, y)

Here's how you use enumerate:

Python code:

or_file = open("test5.txt", 'r')
outtext = or_file.readlines()
or_file.close()
f_text = list()
x = 1
newtxt = list()

for line in outtext:
	y = str(x)
	newtxt.append(y + " " + line)
	x = x + 1

for index, line in enumerate(newtxt):	
	if index % 2 == 0:
		f_text.append(line[:])
	
print f_text

What you have is fine, it works. You can refactor it a little bit based on an understanding of what's happening behind the scenes. You shouldn't be expected to know any of this, so don't get overwhelmed or discouraged. I'm bored, have a cold and like to post, so here's words.

We can rewrite this using one loop so we aren't loading your file into memory and then iterating over it. We can get a handle on the file, and then read a line of the file into memory, do some operations and read the next line into memory. If we do it that way, only one "line" of memory is used, we don't have to wait for it to be loaded completely and we aren't iterating over the contents 3 times (.readlines() pulls in every line of the file).

This interleaved fashion of lockstep read/operation/read can be faster than doing a single operation that will cause the whole program to wait until its over to continue executing. Generators are a feature in Python that take advantage of this, if you're interested.

Python code:

f_text = list()

with open('text5.txt', r) as or_file:       # with blocks ensure file.close() is always called
    for index, line in enumerate(or_file):
        if index % 2 == 0:
            f_text.append(str(index) + line)

# or we can use a list comprehension, whatever floats your boat
with open('text5.txt', r) as or_file:       
    f_text = [str(index) + line
              for index, line in enumerate(or_file)
              if index % 2 == 0]

print f_text

quote:

I can probably get rid of 10 lines or more by doing 1 crazy trick. and I bet it's simple and everyone is laughing at me.

Informative verbosity and explicitness are ideals that are given some reverence in the Python community. One liners can be laughable if they don't convey the authors intent in an easy to understand manner.

Don't try to over think it and try to be clever. People who have to touch your code later will not like you.

Anyway, have fun learning and come back if you have more questions.

salisbury shake fucked around with this message at 17:54 on Mar 23, 2015

# ? Mar 23, 2015 17:29

QuarkJets: Sep 8, 2008

Dominoes posted:

Should still be able to ignore them. (Not just mypy - any valid Python func annotation).

Another q - might be more of a C / memory allocation question. What's the best way to trim empty values from an array?

I'm making a substitute for np.nonzero:
Python code:
   result_i = 0
    result = np.empty(M, dtype=np.int)
    for i in range(M):
        if diffs[i] != 0:
            result[result_i] = i
            result_i += 1
    return result
I'm setting up the array to its maximum size. I get a valid (and faster) result than np.nonzero, but it has a bunch of 0s and arbitrary numbers hanging off the end. Is there a fast way to trim it, or a better way to approach? Initializing the array with 0s and using np.trim_zeros is far too slow. I don't expect much/any speed increase over numpy's func, but I'm curious academically.

I tried this:
Python code:
    result_i = 0  # index, and also serves as size for new array.
    result = np.empty(M, dtype=np.int)
    for i in range(M):
        if diffs[i] != 0:
            result[result_i] = i
            result_i += 1

    result_trimmed = np.empty(result_i, dtype=np.int)
    for i in range(result_i):
        result_trimmed[i] = result[i]

    return result_trimmed
, but it ends up taking longer than np.nonzero.

Related question to Dot: Do you know why using np.empty, np.zeros etc will cause Numba to fail when using nopython=True?
Python code:
@numba.jit(nopython=True)
def test():
    np.empty(50)
    return 2


...

UntypedAttributeError: Failed at nopython (nopython frontend)
Unknown attribute "empty" of type Module(<module 'numpy' from 
'C:\\Users\\David\\Anaconda3\\lib\\site-packages\\numpy\\__init__.py'>)

Numpy array elements are contiguous in memory, so you might be able to quickly reshape it somehow in a way that Numba likes. Have you tried simply slicing into the "result" array and returning that? I'm assuming that it will allocate a new array, which is really not what you want, but maybe numba is smart enough to handle slices or reshapes in a way that doesn't allocate a new array.

You could also consider an alternative function that modifies the input array in-place, if it's an integer array. That saves you one array's worth of memory allocation, which can be fine depending on what you're doing and will definitely be faster. Might not be worth the effort, though

# ? Mar 23, 2015 18:46

accipter: Sep 12, 2003

Dominoes posted:

That wikipedia function for single-pass variance , with an added numba decorator, works twice as fast as the two-pass method I was using. (Pass 1 is finding the mean)

The next two moments can also be included in one pass.

# ? Mar 23, 2015 20:09

Dominoes: Sep 20, 2007

accipter posted:

The next two moments can also be included in one pass.

Cool!

# ? Mar 23, 2015 20:34

Cingulate: Oct 23, 2012; by Fluffdaddy

Dominoes, I don't think one could do something similar for e.g. the linear solvers in Numpy though? (Compared to the Intel MKL build of Numpy.) As, probably, they're already ran as compiled stuff in C or Fortran anyways.

# ? Mar 23, 2015 21:01

Dominoes: Sep 20, 2007

No idea, but you reminded me I've been doing the speed tests against non-MKL numpy. Signed up for an Anaconda MKL trial.

Faster GLM/RLM would be nice though; coincidentally it's the limfac in one of my projects.

Quark - looks like array slicing works.

Dominoes fucked around with this message at 21:24 on Mar 23, 2015

# ? Mar 23, 2015 21:10

Nippashish: Nov 2, 2005; Let me see you dance!

It would be useful to have a really fast argmax function that can operate over batches. If I have a rank-k tensor I'd like to be able to take the argmax over the contiguous dimension really really quickly (preferably using multiple cores). Even with the MKL numpy still only uses one core for argmax and this operation has been my bottleneck many times.

# ? Mar 23, 2015 21:30

Cingulate: Oct 23, 2012; by Fluffdaddy

Dominoes posted:

Faster GLM/RLM would be nice though; coincidentally it's the limfac in one of my projects.

Have you looked at MATLAB's mldivide?

# ? Mar 24, 2015 00:43

QuarkJets: Sep 8, 2008

Cingulate posted:

Have you looked at MATLAB's mldivide?

This is the first time that I've ever heard someone suggest MATLAB in response to "I need this to run faster"

# ? Mar 24, 2015 06:30

Harriet Carker: Jun 2, 2009

I have another probably D'oh moment question. I have to write a program that sums a list without using sum, import, for, while, or a couple other keywords. I wrote a really simple recursive function:

code:

def checkio(data):
    if len(data) == 1:
        print(data[0])
        return data[0]
    data[-2] = data[-1] + data[-2]
    del data[-1]
    checkio(data)

print(checkio([1,2,3,4,5]))

The output is:

code:

15
None

I can't for the life of my understand why the return value is None when it correctly calculates and prints the sum I want to return inside the same if block! Please come to my rescue again, wonderful Python goons.

# ? Mar 26, 2015 07:25

Dominoes: Sep 20, 2007

dantheman650 posted:

I can't for the life of my understand why the return value is None when it correctly calculates and prints the sum I want to return inside the same if block! Please come to my rescue again, wonderful Python goons.

Prepend return to the function's last line.

Is there a reason why python doesn't implement numpy/matlab-style indexing with multiple keys or indices?

Ie for a list of dicts:

Python code:

adict[3, 'akey']

As a cleaner alternative to

Python code:

adict[3]['akey']

Dominoes fucked around with this message at 07:41 on Mar 26, 2015

# ? Mar 26, 2015 07:29

Harriet Carker: Jun 2, 2009

Dominoes posted:

Prepend return to the function's last line.

Me do recursive functions good.

# ? Mar 26, 2015 07:34

Nippashish: Nov 2, 2005; Let me see you dance!

Dominoes posted:

Prepend return to the function's last line.

Is there a reason why python doesn't implement numpy/matlab-style indexing with multiple keys or indices?

Ie for a list of dicts:
Python code:
adict[3, 'akey']
As a cleaner alternative to
Python code:
adict[3]['akey']

Matlab doesn't do that for nested structures (e.g. you can't index a cell array of matrices like that). Numpy does do that for multidimensional arrays (e.g. my_array[1,2,3]). If you use a dictionary whose keys are tuples you can do the type of indexing you want, at the cost of using a completely different data structure.

# ? Mar 26, 2015 10:01

Cingulate: Oct 23, 2012; by Fluffdaddy

QuarkJets posted:

This is the first time that I've ever heard someone suggest MATLAB in response to "I need this to run faster"

I do that occasionally - of course, many of MATLAB's basic capabilities are state of the art, top of the line.

Though if anybody knows of a linear system solver faster than MATLABs mldivide, please please do tell me.

# ? Mar 26, 2015 11:41

Cingulate: Oct 23, 2012; by Fluffdaddy

Dominoes posted:

Prepend return to the function's last line.

Is there a reason why python doesn't implement numpy/matlab-style indexing with multiple keys or indices?

Ie for a list of dicts:
Python code:
adict[3, 'akey']
As a cleaner alternative to
Python code:
adict[3]['akey']

The answer is Pandas.

# ? Mar 26, 2015 11:42

Buffis: Apr 29, 2006; I paid for this; Fallen Rib

Dominoes posted:

Is there a reason why python doesn't implement numpy/matlab-style indexing with multiple keys or indices?

Ie for a list of dicts:
Python code:
adict[3, 'akey']
As a cleaner alternative to
Python code:
adict[3]['akey']

Tuples are (usually) allowed to omit the parenthesis, as in:
>>> a={}
>>> a[1, "hey"]=2
>>> a
{(1, 'hey'): 2}
>>> a[1, "hey"]
2

So the syntax mentioned above implies that a tuple is being looked up in a dict.

# ? Mar 26, 2015 14:32

Buffis: Apr 29, 2006; I paid for this; Fallen Rib

Comedy answer:

code:

class LULZ(dict):
  "Don't do this"
  def __getitem__(self, plzno):
    if type(plzno) == tuple: dict.__setitem__(self, *plzno)
    else: return dict.__getitem__(self, plzno)

d=LULZ()
d[3, 'akey']
print d[3] # outputs 'akey'

# ? Mar 26, 2015 14:43

QuarkJets: Sep 8, 2008

Cingulate posted:

of course, many of MATLAB's basic capabilities are state of the art, top of the line.

This is a dirty lie (unless you're choosing a very narrow definition of "basic") and you shouldn't spread it

# ? Mar 26, 2015 18:18

Cingulate: Oct 23, 2012; by Fluffdaddy

QuarkJets posted:

This is a dirty lie (unless you're choosing a very narrow definition of "basic") and you shouldn't spread it

mldivide is a very basic capacity of MATLAB - it's a single-character operator, more basic than element-wise matrix operations! - and for all I know, state of the art, top of the line. I don't know what you're talking about.

# ? Mar 26, 2015 21:41

SurgicalOntologist: Jun 17, 2004

Well, in the end the hard work is done by LAPACK or MKL or some other low-level implementation. So, it is probably not appreciably different in efficiency from any other high-level language which also calls out to those libraries.

http://stackoverflow.com/a/18553768

# ? Mar 26, 2015 23:03

Cingulate: Oct 23, 2012; by Fluffdaddy

Yes, basically mldivide is seemingly very good at picking what specific package to call (e.g., suitesparse/UMFPACK), and of course Mathworks pays the cash for licensing the MKL and so on.

So, it's state of the art.

# ? Mar 26, 2015 23:13

ButtWolf: Dec 30, 2004; by Jeffrey of YOSPOS

moving this question to python thread

code:

def calc(dna_str):
	
	for x in dna_str:
		gcp = ( (dna_str.count('G') + dna_str.count('C')) / (len(dna_str) ) * 100)
		#I didn't know about item.count('G') so I had 15 lines to do this.
		
	return gcp

with open('dna_id.txt', 'r') as ros:    <<< says the problem is here on 19
    rosa = {line[1:].split(' ')[0]:calc(line.split(' ')[1].strip()) for line in ros if line.strip()} 
    #this line is ugly and I had help putting it together anyway so this is not entirely my fault
    #is there a good way to do this?
    #FILE
    #>Rosalind_0498
    #ACTGCTGACTGACTGACTGACTG
    #>Rosalind_2840
    #ATGCATGTTTACGACTACGTACTGCCGCGCGCC
    #etc...

    #taking Rosalind_#### into the key in dict dna_str
    #value for each key is gcp from function

#do not delete
top_key = max(rosa, key=rosa.get)
print(top_key, rosa.get(top_key))

indexerror: list index out of range. I'll fix the function too so its less gross. Part of my problem is that I'm trying to something, and I have don't have the skill or knowledge yet to do some things. But I'm frankensteining a program using parts I know and then looking up how to do stuff without an actual understanding of it.

I found the problem I think, but still don't know what to do with it. if i go into the txt file and make the dna sequence start on the same line as the ID, and end on the same line. It works. I have no idea how long each string is going to be though so I can't program that in advance.

ButtWolf fucked around with this message at 02:47 on Mar 27, 2015

# ? Mar 27, 2015 02:04

QuarkJets: Sep 8, 2008

Cingulate posted:

mldivide is a very basic capacity of MATLAB - it's a single-character operator, more basic than element-wise matrix operations! - and for all I know, state of the art, top of the line. I don't know what you're talking about.

I'm talking about all of the other "basic features" of MATLAB that are notoriously slow and archaic as gently caress. The features of Matlab that were inherited from older projects work great. Basically, any feature that can't be vectorized with a function written in Fortran before 1990 is complete garbage. Mathworks can't even provide basic list functionality without making it an O(n) operation, for gently caress's sake

By your logic, Fortran77 is a state of the art programming language

# ? Mar 27, 2015 02:47

KICK BAMA KICK: Mar 2, 2009

jimcunningham posted:

Is there any whitespace in the input file other than the newlines? If not, why all the split()ing and .strip()ing? As I understand it, what you intend to do is take lines in groups of two, the first line of each pair being the header from which you need to extract that four-digit (they're always four digits?) ID and the second being the DNA string to process. I really can't make heads or tails of what's happening in that comprehension to tell you exactly what the IndexError is but I'm pretty sure at least one of the problems is just because you're processing each line of the file rather than considering headers and DNA strings differently. Here's the approach I'd take:

Read the file into memory with .readlines() so we can access its contents as a list of lines
Don't build a dict unless you need to do more calculations -- just set variables like max_score and id_of_max and keep track as we iterate over the file
Iterate over indices stepping by twos to consider a pair of lines at a time, i.e., for i in range(0, len(lines), 2):; then line[i] is the header, from which you can easily extract the last four characters for the id and line[i+1] is the DNA string
Calculate the score of each DNA string and update your max variables if necessary; at the end of the iteration, those hold your answers

# ? Mar 27, 2015 02:53

ButtWolf: Dec 30, 2004; by Jeffrey of YOSPOS

KICK BAMA KICK posted:

Is there any whitespace in the input file other than the newlines? If not, why all the split()ing and .strip()ing? As I understand it, what you intend to do is take lines in groups of two, the first line of each pair being the header from which you need to extract that four-digit (they're always four digits?) ID and the second being the DNA string to process. I really can't make heads or tails of what's happening in that comprehension to tell you exactly what the IndexError is but I'm pretty sure at least one of the problems is just because you're processing each line of the file rather than considering headers and DNA strings differently. Here's the approach I'd take:
Read the file into memory with .readlines() so we can access its contents as a list of lines
Don't build a dict unless you need to do more calculations -- just set variables like max_score and id_of_max and keep track as we iterate over the file
Iterate over indices stepping by twos to consider a pair of lines at a time, i.e., for i in range(0, len(lines), 2):; then line[i] is the header, from which you can easily extract the last four characters for the id and line[i+1] is the DNA string
Calculate the score of each DNA string and update your max variables if necessary; at the end of the iteration, those hold your answers

No whitespace. Like i said, I didn't write that line I was having trouble shooting poo poo into the dict. got help on SO.
strip really doesn't need to be in there, but I just found out that they were seeing it as a different format so strip was useful.

the key needs to be after > so Rosalind_3092. Always 4 numbers.
I did kind of what you are saying to do without a dict, but still couldn't do it. I thought this would be easier. I was wrong. I'm gonna learn more basics before I do anything else.

# ? Mar 27, 2015 03:00

KICK BAMA KICK: Mar 2, 2009

Here's my code if it helps for illustration; didn't test any of this so I'm just going by how you described the input.

code:

# If all you need is the maximum-score line, I would do this the classic way
# Just keeping track with two variables as we go
max_id = ''
max_score = 0

with open('dna_id.txt', 'r') as in_file:
    # I'm sure there's a way to do this without reading it all in at once
    # but let's go ahead for easy access by indices, as you'll see below
    ros = in_file.readlines()

for i in range(0, len(ros), 2):  # Iterate over the indices by twos, starting at zero
    header, dna_str = ros[i], ros[i+1]  # ros[0] is a header, ros[1] DNA; ros[2] a header, ros[3]DNA, etc.
    dna_id = header[-4:]  # A slice containing the last four characters
    score = calc(dna_str)
    if score > max_score:
        max_score = score
        max_id = dna_id
        
print(max_id, max_score)

You might benefit from something like a Codecademy tutorial that just focuses on Python syntax and the standard library rather than problem-solving, but don't worry, this is kinda how it goes for everyone learning on their own.

# ? Mar 27, 2015 03:05

ButtWolf: Dec 30, 2004; by Jeffrey of YOSPOS

out of range. I'm going back to manual labor. Got it. Ahh christ.

ButtWolf fucked around with this message at 03:16 on Mar 27, 2015

# ? Mar 27, 2015 03:09

Cingulate: Oct 23, 2012; by Fluffdaddy

QuarkJets posted:

I'm talking about all of the other "basic features" of MATLAB that are notoriously slow and archaic as gently caress. The features of Matlab that were inherited from older projects work great. Basically, any feature that can't be vectorized with a function written in Fortran before 1990 is complete garbage. Mathworks can't even provide basic list functionality without making it an O(n) operation, for gently caress's sake

By your logic, Fortran77 is a state of the art programming language

I think you're trying to disprove something quite different from what I stated, such as "MATLAB is a good general programming language" or something like that.

mldivide is a basic MATLAB functionality, and it is state of the art, top of the line. I assume the same goes for e.g. dot products or matrix inversions. Thus, many of MATLAB's basic capabilities are state of the art, top of the line.

I've never actually compared the numpy/scipy versions to MATLAB; I'd assume with default installations, they're somewhat to noticeably slower.

# ? Mar 27, 2015 12:21

baka kaba: Jul 19, 2003; PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

jimcunningham posted:

No whitespace. Like i said, I didn't write that line I was having trouble shooting poo poo into the dict. got help on SO.
strip really doesn't need to be in there, but I just found out that they were seeing it as a different format so strip was useful.

the key needs to be after > so Rosalind_3092. Always 4 numbers.
I did kind of what you are saying to do without a dict, but still couldn't do it. I thought this would be easier. I was wrong. I'm gonna learn more basics before I do anything else.

What exactly are you trying to do? Like specifically, what are you working with and what do you need to do with it?
You're processing a set of data here, so the first thing you need to do is work out what the data does or could look like. Then you need to work out how to pull out the separate bits of data, handling any possible variations and skipping junk lines, so each time you're starting fresh in the right place to read the next chunk of data. Once you have that sorted out, you just need to worry about handling the data you've pulled out - storing it or processing it, whatever.

code:

>Rosalind_0498
ACTGCTGACTGACTGACTGACTG
>Rosalind_2840
ATGCATGTTTACGACTACGTACTGCCGCGCGCC
etc...

So looking at this, it seems like your data is organised in groups of two lines? As in, there's a line break after the header, it's not a single line separated with a space and you're looking at it with word wrap enabled? And doing

Python code:

with open('dna_id.txt') as ros:
    print(ros.readlines())

prints a list where the header and the DNA string are in separate elements?
['>Rosalind_0498\n', 'ACTGCTGACTGACTGACTGACTG\n', '>Rosalind_2840\n', 'ATGCATGTTTACGACTACGTACTGCCGCGCGCC']

If that's the case, on each iteration you need to read in the first line, strip the newline and the leading text, and you have your header ID. Then you need to read in the second line and strip the newline. Once you have those clean pieces of data you can run your processing and wang them in a dictionary or whatever. So something like

Python code:

with open('dna_id.txt') as ros:
	header = ros.readline().strip()
	header = header.split('_')[1]
	dna = ros.readline().strip()
	print("{}: {}".format(header, dna))

Anyway if you can guarantee that lines will always be in pairs, with no blank lines inbetween, you can make a loop that tries to read in a header line, and if it gets one, it reads the next dna line in and handles them both. If it gets a blank string instead (which signifies the end of the file) it stops looping.

But if you can put both bits of data on the same line, separated by a space, it makes things a little cleaner since you can just read a line at a time:

Python code:

rosa = dict()
with open('dna_id.txt') as ros:
        for line in ros:
            header, dna = line.rsplit()
            header = header.split('_')[1]
            rosa[header] = calc(dna)

Just iterating over the file like that strips the newlines, so all you need to do is break it into two halves by splitting on the space, and then strip off the stuff before the underscore in the header. (This is what your code is expecting by the way, that's why it only worked when you put them on the same line)

If this is enough really depends on your data - you might want to put checks in to make sure there aren't extra spaces, or underscores, or blank lines etc., things that will trip up the algorithm and require some error handling. Validation checking, basically. If it's a one-off script and you're sure your input file is ok then it might be enough. (Also you might want to make your dictionary keys ints when you add them?)

Oh yeah, I've been a little bit wordy - you could definitely cut down the number of lines if you wanted (like doing the header split in the mapping line) but I wanted it to be clear what's happening. Your SO line looks like code golf, i.e. doing stuff in the minimum number of characters just for the hell of it, no wonder you're having trouble understanding it

baka kaba fucked around with this message at 14:43 on Mar 27, 2015

# ? Mar 27, 2015 14:09

ButtWolf: Dec 30, 2004; by Jeffrey of YOSPOS

Yeah i finally got it. Thanks guys. In all honesty it my have been to.much to.handle, since my python class is just getting to dicts anyway. I know it can be done other ways this way just made sense to my limited brain. Also some strings have 2 some 5 so telling tofind a number of \ns wasnt gonna work. It was eye opening at least to know that i need to slow down.

# ? Mar 27, 2015 16:08

baka kaba: Jul 19, 2003; PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

Well you can still do other things - identify the start of a new section by the > character starting a line, or check if a line only contains characters from GACT, and so on. Lots of options, the answer is always It Depends. Honestly the dicts are probably the easy part, there's a bunch of tricky stuff in there, so don't get discouraged

The thing about this (and all programming) is that you need experience to be able to start from the very end and work backwards, like with your SO code. When you're solving a problem you need to be able to break it down into steps, and work out a system for handling those steps. When you're done and you have something that works, maybe then you'll want to slim it down and refine it, and maybe even make a one-liner. But you usually need to have gone through the experience of wrasslin' the problem, first. Otherwise you're just crossing your fingers and hoping, and when it doesn't work you don't know where to start looking

# ? Mar 27, 2015 16:34

QuarkJets: Sep 8, 2008

Cingulate posted:

I think you're trying to disprove something quite different from what I stated, such as "MATLAB is a good general programming language" or something like that.

mldivide is a basic MATLAB functionality, and it is state of the art, top of the line. I assume the same goes for e.g. dot products or matrix inversions. Thus, many of MATLAB's basic capabilities are state of the art, top of the line.

And I agreed with you, MATLAB's features that were built pre-1990 in Fortran by someone other than Mathworks run really well, so long as you can completely vectorize the operation

# ? Mar 27, 2015 19:20

SurgicalOntologist: Jun 17, 2004

As do all the other major scientific computing options, as they all are built on the same C/Fortran libraries. This is silly.

# ? Mar 27, 2015 19:30

Nippashish: Nov 2, 2005; Let me see you dance!

Matlab is very good at the things it was designed to do. It's also very bad at some other things that turn out to be quite useful, but that doesn't make it less good at the things it does well.

# ? Mar 27, 2015 22:19

Cingulate: Oct 23, 2012; by Fluffdaddy

QuarkJets posted:

And I agreed with you, MATLAB's features that were built pre-1990 in Fortran by someone other than Mathworks run really well, so long as you can completely vectorize the operation

UMFPACK was released in 1994 and is in C though, and I assume the parts of mldivide that check if UMFPACK is appropriate are not written in pre-1990 Fortran either.
And I guess this must be my final contribution to this slightly silly derail.

# ? Mar 27, 2015 22:36

QuarkJets: Sep 8, 2008

Cingulate posted:

UMFPACK was released in 1994 and is in C though, and I assume the parts of mldivide that check if UMFPACK is appropriate are not written in pre-1990 Fortran either.
And I guess this must be my final contribution to this slightly silly derail.

I'm being hyperbolic obviously; my point is that MATLAB is a lovely software platform that provides access to some really good Fortran and C libraries, and in a thread about Python it's inappropriate to point people to MATLAB for fast computation. These features are all based on the same libraries, and in my experience the Python implementation is either just as fast (MKL Numpy vs MKL Matlab functions) or much faster (code segments that Numba can just compile for you, or anything that runs on a GPU; the underlying CUDA functions are the same, but the MATLAB wrappers on some of them can be really slow)

Or in other words:

SurgicalOntologist posted:

Well, in the end the hard work is done by LAPACK or MKL or some other low-level implementation. So, it is probably not appreciably different in efficiency from any other high-level language which also calls out to those libraries.

http://stackoverflow.com/a/18553768

# ? Mar 28, 2015 04:00

Adbot: ADBOT LOVES YOU

# ? May 9, 2024 08:47

Cingulate: Oct 23, 2012; by Fluffdaddy

Hm. In that case, I'm going to make my suggestion to Dominoes more clear: if solving linear systems is a limiting factor in your stuff, maybe take a look at MATLAB's mldivide, which
1. is somewhat well documented (and its parameters for every call can be laid bare) - e.g. SO's link
2. makes calls to the best actual number crunchers, so you can learn what the best actual number crunchers are
3. is pretty good at deciding what number crunchers to call based on the properties of the input (sparse, square etc.)

# ? Mar 28, 2015 11:25

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »