This is a script that recursively iterates from a starting directory, and counts the number of files in a subdirectory that match a given regular expression.
The matches found are printed to the screen when the script is done running.
argparse
is used to parse arguments to the script. Arguments are accepted to specify the directory where scanning will start, as well as the regular expression that must match in the files.
The code was tested in Ubuntu 16.04 with Python 2.7.12, and in Windows 10 with Python 2.7.15. The code does not make any assumption about the operating system. When joining file paths, for example, os.path.join
is used instead of manually concatenating strings.
The code is designed to be scalable by using worker processes to perform the search of a regular expression in a given file. The main process spawns a pool of workers. The main process is responsible for looking for files to scan (scanning the directory structure), and providing the full path to the files to scan to the worker processes. After walking through the existing directories, the main process is then responsible for gathering the results of whether a file matches the given regular expression or not, and collecting them in a dictionary.
The code is designed to handle errors, and continue working if possible. The parameters are validated at the beginning, to ensure they make sense. Places where things can fail (reading a file, scanning a directory, waiting for results of worker processes) are protected by handling the possible errors that might happen.
This section defines the tests that should be performed for the script.
- Regular expressions work as expected
- Add 3 files to a directory, file1.txt, file2.txt, file3.txt.
- file1.txt contains the line:
CS12345
- file3.txt contains the line:
cs99999
- Run script like so:
python findfiles.py "directory" "CS[0-9]" -v
- Output should show that match was found in file1.txt, only 1 match in directory.
- Run script like so:
python findfiles.py "directory" "cs[0-9]" -v
- Output should show that match was found in file3.txt, only 1 match in directory.
- Run script like so:
python findfiles.py "directory" "[Cc][Ss][0-9]" -v
- Output should show that match was found in file1.txt and file3.txt, 2 matches in directory.
- Recursion works as expected
- Create the following directory structure under a test directory:
- dir1
- dir2
- dir4
- dir5
- dir3
- dir6
- dir7
- dir8
- dir6
- Add the following files:
- file1.txt under dir1, contains line:
[email protected]
- file2.txt under dir4
- file3.txt under dir5, contains line:
[email protected]
- file4.txt under dir6
- file5.txt under dir6
- file6.txt under dir8, contains line:
[email protected]
- file7.txt under dir8
- file8.txt under dir8, contains line:
%person%[email protected]
- file1.txt under dir1, contains line:
- Run script like so:
python findfiles.py "directory" "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}" -v
- Output should show that match was found in file1.txt, file6.txt and file8.txt
- Create the following directory structure under a test directory:
- Script still works when graphing package is not present
- Make sure the Python environment does not contain the package
matplotlib
- Run script like so:
python findfiles.py "directory" "[Cc][Ss][0-9]" -g
. Directory should exist, doesn't matter if directory contains matches.- Output should show matches found, and a message saying that the package
matplotlib
is not installed should appear, as well as instructions to install it.
- Output should show matches found, and a message saying that the package
- Make sure the Python environment does not contain the package
- Graphing functionality works as expected
- Use the same directory structure as the previous test case
- Run script like so:
python findfiles.py "directory" "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}" -g
- Output should show 1 match in dir1 and 2 matches in dir8, and should generate a file called
matchfig.png
with a graph showing the matches per directory found.
- Output should show 1 match in dir1 and 2 matches in dir8, and should generate a file called
- Script with no arguments shows usage
- Run script like so:
python findfiles.py
- Message showing how to use the script appears, and also tells user there are not enough arguments.
- Run script like so:
python findfiles.py "directory"
- Message showing how to use the script appears, and also tells user there are not enough arguments.
- Run script like so:
- Non-existing directory shows appropriate error
- Run script like so:
python findfiles.py "non-existing-directory" "[Cc][Ss][0-9]"
, making sure that the directory passed as argument does not exist.- Message saying directory does not exist should appear.
- Run script like so:
- Passing an existing file as starting directory shows appropriate error
- Run script like so:
python findfiles.py "file1.txt" "[Cc][Ss][0-9]"
, making sure that file1.txt is an existing file.- Message saying root path is not a directory should appear.
- Run script like so:
- Passing an invalid regular expression shows appropriate error
- Run script like so:
python findfiles.py "directory" "["
, making sure the directory passed as argument exists.- Message saying regular expression is not valid should appear.
- Run script like so:
- Unreadable files are handled appropriately and do not prevent finding other matches
- Create a test directory and add the files, file1.txt, file2.txt, file3.txt
- file1.txt contains the line:
CS12345
- Change the permissions of file2.txt so that is not readable by the current user
chmod -r file2.txt
on Linux- In Windows, right-click on the file in File Explorer, select
Properties
. Select theSecurity
tab in the dialog that appears. Select your user in the list and then click theEdit
permissions button. In the permissions dialog select your user again and check theRead & execute
andRead
checkboxes under theDeny
column. ClickOk
to dismiss both dialogs.
- Run script like so:
python findfiles.py "directory" "[Cc][Ss][0-9]"
- Message saying that file2.txt could not be read appears, match is still found in file1.txt
- Unreadable directories are handled appropriately and do not prevent finding other matches
- Create a test directory with the following structure:
- testdir
- subdir1
- subdir2
- testdir
- Add the following files:
- file1.txt under subdir1, contains line:
CS12345
- file2.txt under subdir1
- file3.txt under subdir2, contains line:
cs99999
- file1.txt under subdir1, contains line:
- Change permissions of testdir/subdir2 so that is not readable by the current user
chmod -r subdir2
on Linux- In Windows, right-click on subdir2 in File Explorer, select
Properties
. Select theSecurity
tab in the dialog that appears. Select your user in the list and then click theEdit
permissions button. In the permissions dialog select your user again and check theRead & execute
,List folder contents
andRead
checkboxes under theDeny
column. ClickOk
to dismiss both dialogs.
- Run script like so:
python findfiles.py "testdir" "[Cc][Ss][0-9]"
- Message saying that testdir/subdir2 could not be listed appears, match is still found in file1.txt
- Create a test directory with the following structure:
- Large files are handled appropriately
- Create a very large text file, in the order of 500MB. Name it file1.txt and put it in a test directory. Make sure an email address appears as the last line in the file, while the rest of the file does not contain an email address.
- Run script like so:
python findfiles.py "directory" "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
- Script should be able to find the match in the large text file.
- Binary files are handled appropriately
- Copy a binary file into a test directory
- Run script like so:
python findfiles.py "directory" "[Cc][Ss][0-9]"
- Depending on the binary file a match might or might not be found, but it should not affect finding matches in other files.
- If a worker process dies for any reason, the rest of the workers and results are not affected
- Create a very large text file, in the order of 500MB. Name it file1.txt and put it in a test directory.
- In the same directory create a file named file2.txt that contains the line
CS12345
- Run script like so:
python findfiles.py "directory" "[Cc][Ss][0-9]"
(make sure the large text file does not contain a match) - After some time (perhaps 20 seconds) kill the worker process that is reading the large text file
- In a machine with 4 processor cores with Linux, for example, the command
ps -a
will list 5 instances ofpython
being executed. Two of them should be running, while three should be marked as<defunct>
. The first of the running processes is probably the main process, so you should kill the other running process.
- In a machine with 4 processor cores with Linux, for example, the command
- After some time a message should appear saying that the script was not able to get expected results from the worker process. The match in file1.txt should appear in the results.