While waiting for a Python Meetup to start1, I decided to start writing a Python version of the duplicate-files script I presented previously. These are my thoughts on the experience of cutting a whole 68-line script.
Follow along in another window as I comment on
How hard was it?
Honestly, the second hardest part was looking up how the builtin objects
and libraries work. I kept flipping between the docs and my source code,
multiple times, to figure out how to get the entries from a
how to iterate over an Array List, and so forth.
The hardest, however, came when I discovered Python’s strict rules about
mutable objects and hash keys.
Because dictionaries, lists, and sets are mutable, attempts to use them
as keys always fail. (Mutable objects can change, changing the hash
Furthermore, Ruby’s YAML and JSON modules accept any iterable object as a
JSON Array or YAML List, while Python’s
json accepted only Lists.3
It also balked at Path objects, which is why line 28 uses strings.3
(And Python has no YAML extension in its standard library.
I could have installed
pyyaml, but I wanted to see what Python
could do out of the box.)
I ran into the same problems turning duplicate file sets into unique and sorted lists; the Ruby code took only a couple of method calls, while my Python code uses a bunch of temporary objects. There’s probably a more elegant solution with list comprehensions, but it was late and I wanted something working first.
Otherwise, the whole thing came together pretty easily.
Does it work?
Here’s a quick test run on them both, using my
– all my little coding projects – as sample data.
$ time duplicate-files.py Projects > Projects/dupes-py.json real 0m4.032s user 0m2.434s sys 0m1.521s
$ time duplicate-files.rb -jp -o Projects/dupes-rb.json Projects Looking for files in ["Projects"]: - done! Found 15150 files in 3123 size groups. Comparing files ... 26593/6246 |########################################|... done! Wrote performance data to /tmp/dfperf20230303-60194-gfcr6l/perf-2023-03-03-124130-960648980.json real 0m5.493s user 0m3.527s sys 0m1.938s
The Python version is about a second faster, but it’s doing way less.4
The Python version also found two more duplicate file sets than the Ruby
version. One of the file sets contained only a null byte (00),
and the other only a newline (0A).
FileUtils.cmp tries to compare files by lines,
filecmp.cmp does a straight binary comparison.
Honestly I thought this might be the hardest part without Ruby’s
module, but a simple recursive algorithm suits the problem much better.
As noted above, not keeping the Path objects annoyed me a bit.
It worked out, but I wish Python’s
json used duck-typing more when
converting a structure to JSON.3
And yes, I couldn’t help putting in a comment marking the function’s end. Too much Ruby and Lua, I guess.
This has two fewer clauses than the Ruby equivalent. I don’t know whether checking that the files exist before comparing them, or comparing paths simply by string name rather than the paths they point to, will hurt me when I start trying to break the script.
Also, I see that my C-instilled habit of putting procedures above the places that use them remains intact …
In the Ruby version this required a few calls in
Because Python lacks the “initializer blocks” the Ruby version uses,
though, I felt I had to make it its own function.
This is a mess. I won’t sugar-coat it. I needed unique sorted lists of all the file sets I created. Briefly I considered making them the same Set instance, but I was worried I’d end up eradicating one or more of the Sets if I got fancy.2 So instead I ended up creating multiple sets, then make the file sets into sorted lists, then made the lists into tuples so I could check for equality, then a set of the tuples to make them unique, then turned the set of unique tuples back into a list so the JSON emitter wouldn’t complain.
This is where the magic happens, and by magic I mean brute-force combinatorics. Like the Ruby version, I compare every file of the same size against every other file in the same size set, unless the size is 0 (which would always succeed).
Both the Ruby and Python versions are unable to infer that if A is a duplicate of B, and A a duplicate of C, than B is a duplicate of C without doing another comparison. I don’t know if that would speed things up or just complicate the code.2
(lines 59-65, 67-68)
Essentially this is the Ruby main function without the option parsing … or user-friendly output … or performance logging … or vertical whitespace. In other words, what the first cut of the Ruby program looked like in the beginning. I think. It was so long ago.
The use of
Path here rather than up in
find_files … tasks me.
It tasks me and I shall fix it. Maybe with a short procedure that kicks
off the recursive calls.
Finally I call the main procedure if the user invoked this script as a program. Because they might mistake it for a module …
Conclusion and Next Steps
This little script is no Zope. It doesn’t even do what my crufty Ruby script can do. It’s just a quick project to get back into Python.
Still, improving it will be my next installment:
Parsing command-line options with
argparseor something else.
Generating (or not generating) messages to let the user know what the script is doing. (Spinning wheels and progress bars optional.)
Write results to a file instead of STDOUT.
pyyamlto generate (pretty-printed) YAML or JSON as desired.
An option to include zero-length files as “duplicates”.
(Maybe) compare all files to those in a “canonical directory”.
sort_uniqto create unique, sorted lists inside a larger list more efficiently and elegantly.
(Maybe) figure out why I can’t accurately estimate how many comparisons a data set will require, thus removing the need for “performace” (sic) data.
(Maybe) deleting duplicates with user approval, thus removing the need for a
Originally in person, but because of bad weather yesterday we held it on Zoom. ↩︎
frozensetinstance for every set of identical files, built by combining sets from each file and setting the result back as the set for each contained file in the
dupsetsindex, would solve the hashing, redundant comparisons, and uniquing problems. a) The sets would already be unique. b) Before comparing files the function could check if the sets for each file already contain the other. c) If all the sets with the same data are the same
frozenset, I can just check if that instance is already in a set, then convert them all to lists and sort them at the same time. I’ll leave that for next version. ↩︎ ↩︎ ↩︎
After writing this, I discovered one can substitute an encoder object into
json.dumps()that understands sets, Paths, or anything else. I’m letting this sentence stand, along with the horrible code I wrote as a result of not knowing this. I’ll fix it next version. ↩︎ ↩︎ ↩︎
Also note the Progress bar is still broken. “26593/6246”? Really? ↩︎
I wanted a step to examine the duplicate files and decide which versions or locations to keep. But maybe there’s another way … ↩︎