Bucle hasta el estado estable de una estructura de datos compleja en Python

I have a more-or-less complex data structure (list of dictionaries of sets) on which I perform a bunch of operations in a loop until the data structure reaches a steady-state, ie. doesn't change anymore. The number of iterations it takes to perform the calculation varies wildly depending on the input.

I'd like to know if there's an established way for forming a halting condition in this case. The best I could come up with is pickling the data structure, storing its md5 and checking if it has changed from the previous iteration. Since this is more expensive than my operations I only do this every 20 iterations but still, it feels wrong.

Is there a nicer or cheaper way to check for deep equality so that I know when to halt?


preguntado el 08 de noviembre de 11 a las 14:11

Could you be more specific (if possible) about what structure it is? Maybe there's a shortcut. -

@Blender: It's a tens to hundreds long list. Each element is a dictionary, keyed by the same five strings. Of their values, one is a string, which never changes throughout the computation, the other four are sets of integers, on which I perform set operations that are functions of the neighboring elements in the main list. -

Why are you keying each dictionary by the same strings? Couldn't you just create a big dictionary and have each key be a list of lists (from each dictionary)? -

@Blender: True, or I could use tuples instead of dicts and just refer to them with indices and I'll probably rewrite it that way. Right now it's just for readability. My problem would persist either way. -

4 Respuestas

Echa un vistazo a python-deep. It should do what you want, and if it's not fast enough you can modify it yourself.

It also very much depends on how expensive the compare operation and how expensive one calculation iteration is. Say, one calculation iteration takes c time and one test takes t time and the chance of termination is p then the optimal testing frequency is:

(t * p) / c

Eso es asumiendo c < t, if that's not true then you should obviously check every loop.

So, since you can dynamically can track c y t and estimate p (with possible adaptions in the code if the code suspects the calculation is going to end) you can set your test frequency to an optimal value.

respondido 08 nov., 11:18

Although I'll stick to standard library tools, thank you for the great insight about optimal testing frequency. I'll profile my functions and take this path. - jozmos

I think your only choices are:

  1. Have every update mark a "dirty flag" when it alters a value from its starting state.

  2. Doing a whole structure analysis (like the pickle/md5 combination you suggested).

  3. Just run a fixed number of iterations known to reach a steady state (possibly running too many times but not having the overhead of checking the termination condition).

Option 1 is analogous to what Python itself does with ref-counting. Option 2 is analogous to what Python does with its garbage collector. Option 3 is common in numerical analysis (i.e. run divide-and-average 20 times to compute a square root).

respondido 08 nov., 11:18

So it seems I haven't missed anything obvious. I've chosen to fine-tune what I've started out with, a hybrid of point 2 and 3 with @nightcracker's parameters. - jozmos

Checking for equality to me doesn't seem the right way to go. Provided that you have full control over the operations you perform, I would introduce a "modified" flag (boolean variable) that is set to false at the beginning of each iteration. Whenever one of your operation modifies (part of) your data structure, it is set to true, and repetition is performed until modified remained "false" throughout a complete iteration.

respondido 08 nov., 11:18

¿Qué hay de esto? state=5 while True: state+=1 state*=0.9 This balances out at 9 but changes the state at every iteration. - orlp

Podrías crear un .isModified() method for the object. - Licuadora

Of course this relies on you being able to check whether there in fact WAS a modification or not. The same holds for adding values to a set which may be already present, which is not always a modification - however, checking this can often be done quite cheaply, e.g. by comparing the respective sizes. Whether or not this would be applicable here requires in fact more details on the actual data structure.. - miserable

The problem is I don't know whether, say, prev['abc'].update(next['def']) has changed my data or not. It would be way harder to track these intermediate changes than even pickling the final result every iteration. - jozmos

You don't track changes, you just update the flag to True in the function doing the updating. - phkahler

I would trust the python equality operator to be reasonably efficient for comparing compositions of built-in objects. I expect it would be faster than pickling+hashing, provided python tests for list equality something like this:

def __eq__(a,b):
    if type(a) == list and type(b) == list:
        if len(a) != len(b):
            return False
        for i in range(len(a)):
            if a[i] != b[i]:
                return False
        return True
    #testing for other types goes here

Since the function returns as soon as it finds two elements that don't match, in the average case it won't need to iterate through the whole thing. Compare to hashing, which does need to iterate through the whole data structure, even in the best case.

Así es como lo haría yo:

import copy

def perform_a_bunch_of_operations(data):
    #take care to not modify the original data, as we will be using it later
    my_shiny_new_data = copy.deepcopy(data)
    #do lots of math here...
    return my_shiny_new_data

data = get_initial_data()
    nextData = perform_a_bunch_of_operations(data)
    if data == nextData: #steady state reached
    data = nextData

This has the disadvantage of having to make a deep copy of your data each iteration, but it may still be faster than hashing - you can only know for sure by profiling your particular case.

respondido 08 nov., 11:18

You're right. I don't really need to hash my object as it's not that big. A deepcopy & compare every N iterations might suffice. - jozmos

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.