Relearning Python #2: duplicate-files.py

In the last post, we wrote a rough cut of a Python script to list duplicated files in one or more directories. In this post we have added command-line options, output options, and some internal improvements.

Improvements

Last time I said I’d do the following:

Parsing command-line options with argparse or something else.

[…]

Write results to a file instead of STDOUT.

Installing pyyaml to generate (pretty-printed) YAML or JSON as desired.

An option to include zero-length files as “duplicates”.

[…]

(Maybe) reworking sort_uniq to create unique, sorted lists inside a larger list more efficiently and elegantly.

Take a look at the latest duplicate-files.py. This script now works almost exactly like the Ruby version.

$ time duplicate-files.py -q Projects -o Projects/dupes-py.yaml

real	0m5.971s
user	0m4.272s
sys 	0m1.675s

$ time duplicate-files.rb -q Projects -o Projects/dupes-rb.yaml

real	0m7.724s
user	0m4.753s
sys 	0m2.896s

$ diff Projects/dupes-{rb,py}.yaml
154a155,156
> - - Projects/3pty/jsonp-api/tck/tck-tests/src/main/resources/jsonObjectUnknownEncoding.json
>   - Projects/3pty/jsonp-api/tck/tck-tests/target/classes/jsonObjectUnknownEncoding.json
4089a4092,4093
> - - Projects/vendor/java/antlr/4.7.2/antlr-python2-runtime-4.7.2/src/antlr4_python2_runtime.egg-info/dependency_links.txt
>   - Projects/vendor/java/antlr/4.7.2/antlr-python3-runtime-4.7.2/src/antlr4_python3_runtime.egg-info/dependency_links.txt

Important Changes In Detail

Please follow along in the Python code

`add_to_dupsets`

(ll 66-73)

As I suggested last time, I’ve rewritten this function to use frozensets which Python will accept as elements in a set. Also, all keys in the dupset mapping from file names to known duplicates share the same frozenset if they’re duplicates.

`compare_files`

(ll 75-86)

I’ve made this function more “Pythonic” by using itertools.combinations instead of nested loops. (The Ruby version now uses something similar.) I also inlined sort_uniq because using frozensets really did make the code simpler. I could have inlined the variable superset, but I wanted to inspect the sets I was creating along the way.

`convert_results`

(ll 134, 135)

This snippet of code to turn nested Paths and sets into strings and lists kept moving around. I made it its own function so I could deploy it as late in processing as possible, because sets and Paths are the best data structures for this script.

`make_argparser`

(ll 88-132)

This is just a big procedure to define command line arguments. Python’s argparse is much more verbose than Ruby’s OptionParser.

`run`

The main function has a few big changes:

Using the argument parser to parse options and the regular directory arguments.
Implementing the -z option by post-processing the results of compare_files.
Adding YAML as an output option … assuming the import doesn’t fail.
Adding “pretty-printing” to both YAML and JSON. The “pretty” YAML isn’t quite the same as in the Ruby script, but honestly it wasn’t that pretty in that script either.
Writing to an output file if the user specifies that option.

Also worth noting: just as PathEncoder strips out the Path objects for JSON, convert_results strips out Path objects and sets for YAML.

More Improvements

I also said I’d do three other things:

Generating (or not generating) messages to let the user know what the script is doing. (Spinning wheels and progress bars optional.)

Right now -q suppresses no output and -v adds no output because there is none. Maybe because it’s my second time doing this I felt no need to print messages, debugging or otherwise.

(Maybe) compare all files to those in a “canonical directory”.

I’m going to deprecate -d. One might as well add it to the directory list, since you’re recursing down that directory either way and I doubt it saves time not checking for duplicates that don’t involve that directory. The only other effect is to put the file from that directory first in the list of duplicates, since remove-files.rb removes all but the first in a set of duplicates. (That’s why -z includes a blank line at the start of the zero-length files.)

(Maybe) figure out why I can’t accurately estimate how many comparisons a data set will require, thus removing the need for “performace” (sic) data.

Instead I removed the Progress class in duplicate-files.rb. I can figure out why I couldn’t predict the number of comparisons later, as a different project.

Vale, `duplicate-files.*`

Despite still having more to do, I’m going to put these scripts aside for now. Instead, next time I’ll write something else in Python: a processor for the screenplay equivalent of Markdown, Fountain.

Improvements

Important Changes In Detail

add_to_dupsets

compare_files

convert_results

make_argparser

run