In the last post, we wrote a rough cut of a Python script to list duplicated files in one or more directories. In this post we have added command-line options, output options, and some internal improvements.
Improvements
Last time I said I’d do the following:
- Parsing command-line options with
argparse
or something else.[…]
Write results to a file instead of STDOUT.
Installing
pyyaml
to generate (pretty-printed) YAML or JSON as desired.An option to include zero-length files as “duplicates”.
[…]
- (Maybe) reworking
sort_uniq
to create unique, sorted lists inside a larger list more efficiently and elegantly.
Take a look at
the latest duplicate-files.py
.
This script now works almost exactly like the Ruby version.
$ time duplicate-files.py -q Projects -o Projects/dupes-py.yaml
real 0m5.971s
user 0m4.272s
sys 0m1.675s
$ time duplicate-files.rb -q Projects -o Projects/dupes-rb.yaml
real 0m7.724s
user 0m4.753s
sys 0m2.896s
$ diff Projects/dupes-{rb,py}.yaml
154a155,156
> - - Projects/3pty/jsonp-api/tck/tck-tests/src/main/resources/jsonObjectUnknownEncoding.json
> - Projects/3pty/jsonp-api/tck/tck-tests/target/classes/jsonObjectUnknownEncoding.json
4089a4092,4093
> - - Projects/vendor/java/antlr/4.7.2/antlr-python2-runtime-4.7.2/src/antlr4_python2_runtime.egg-info/dependency_links.txt
> - Projects/vendor/java/antlr/4.7.2/antlr-python3-runtime-4.7.2/src/antlr4_python3_runtime.egg-info/dependency_links.txt
Important Changes In Detail
Please follow along in the Python code
add_to_dupsets
(ll 66-73)
As I suggested last time, I’ve rewritten this function to use
frozenset
s which Python will accept as elements in a set
.
Also, all keys in the dupset
mapping from file names to known duplicates
share the same frozenset
if they’re duplicates.
compare_files
(ll 75-86)
I’ve made this function more “Pythonic” by using itertools.combinations
instead of nested loops. (The Ruby version now uses something similar.)
I also inlined sort_uniq
because using frozenset
s really did make the
code simpler. I could have inlined the variable superset
, but I wanted
to inspect the sets I was creating along the way.
convert_results
(ll 134, 135)
This snippet of code to turn nested Path
s and set
s into strings and lists
kept moving around. I made it its own function so I could deploy it as late
in processing as possible, because sets and Paths are the best data structures
for this script.
make_argparser
(ll 88-132)
This is just a big procedure to define command line arguments.
Python’s argparse
is much more verbose than Ruby’s OptionParser
.
run
The main function has a few big changes:
-
Using the argument parser to parse options and the regular directory arguments.
-
Implementing the
-z
option by post-processing the results ofcompare_files
. -
Adding YAML as an output option … assuming the import doesn’t fail.
-
Adding “pretty-printing” to both YAML and JSON. The “pretty” YAML isn’t quite the same as in the Ruby script, but honestly it wasn’t that pretty in that script either.
-
Writing to an output file if the user specifies that option.
Also worth noting: just as PathEncoder
strips out the Path objects for
JSON, convert_results
strips out Path objects and sets for YAML.
More Improvements
I also said I’d do three other things:
- Generating (or not generating) messages to let the user know what the script is doing. (Spinning wheels and progress bars optional.)
Right now -q
suppresses no output and -v
adds no output
because there is none.
Maybe because it’s my second time doing this I felt no need to print messages,
debugging or otherwise.
- (Maybe) compare all files to those in a “canonical directory”.
I’m going to deprecate -d
. One might as well add it to the directory
list, since you’re recursing down that directory either way and I doubt
it saves time not checking for duplicates that don’t involve that directory.
The only other effect is to put the file from that directory first in the
list of duplicates, since remove-files.rb
removes all but the first
in a set of duplicates. (That’s why -z
includes a blank line at the
start of the zero-length files.)
- (Maybe) figure out why I can’t accurately estimate how many comparisons a data set will require, thus removing the need for “performace” (sic) data.
Instead I removed the Progress class in
duplicate-files.rb
.
I can figure out why I couldn’t predict the number of comparisons later,
as a different project.
Vale, duplicate-files.*
Despite still having more to do, I’m going to put these scripts aside for now. Instead, next time I’ll write something else in Python: a processor for the screenplay equivalent of Markdown, Fountain.