CedarBackup3.filesystem
¶
Provides filesystem-related objects. :author: Kenneth J. Pronovici <pronovic@ieee.org>
Module Contents¶
- CedarBackup3.filesystem.logger¶
- class CedarBackup3.filesystem.FilesystemList¶
Bases:
list
Represents a list of filesystem items.
This is a generic class that represents a list of filesystem items. Callers can add individual files or directories to the list, or can recursively add the contents of a directory. The class also allows for up-front exclusions in several forms (all files, all directories, all items matching a pattern, all items whose basename matches a pattern, or all directories containing a specific “ignore file”). Symbolic links are typically backed up non-recursively, i.e. the link to a directory is backed up, but not the contents of that link (we don’t want to deal with recursive loops, etc.).
The custom methods such as
addFile
will only add items if they exist on the filesystem and do not match any exclusions that are already in place. However, since a FilesystemList is a subclass of Python’s standard list class, callers can also add items to the list in the usual way, using methods likeappend()
orinsert()
. No validations apply to items added to the list in this way; however, many list-manipulation methods deal “gracefully” with items that don’t exist in the filesystem, often by ignoring them.Once a list has been created, callers can remove individual items from the list using standard methods like
pop()
orremove()
or they can use custom methods to remove specific types of entries or entries which match a particular pattern.Note: Regular expression patterns that apply to paths are assumed to be bounded at front and back by the beginning and end of the string, i.e. they are treated as if they begin with
^
and end with$
. This is true whether we are matching a complete path or a basename.- excludeFiles¶
- excludeDirs¶
- excludeLinks¶
- excludePaths¶
- excludePatterns¶
- excludeBasenamePatterns¶
- ignoreFile¶
- addFile(path)¶
Adds a file to the list.
The path must exist and must be a file or a link to an existing file. It will be added to the list subject to any exclusions that are in place.
- Parameters
path (String representing a path on disk) – File path to be added to the list
- Returns
Number of items added to the list
- Raises
ValueError – If path is not a file or does not exist
ValueError – If the path could not be encoded properly
- addDir(path)¶
Adds a directory to the list.
The path must exist and must be a directory or a link to an existing directory. It will be added to the list subject to any exclusions that are in place. The
ignoreFile
does not apply to this method, only toaddDirContents
.- Parameters
path (String representing a path on disk) – Directory path to be added to the list
- Returns
Number of items added to the list
- Raises
ValueError – If path is not a directory or does not exist
ValueError – If the path could not be encoded properly
- addDirContents(path, recursive=True, addSelf=True, linkDepth=0, dereference=False)¶
Adds the contents of a directory to the list.
The path must exist and must be a directory or a link to a directory. The contents of the directory (as well as the directory path itself) will be recursively added to the list, subject to any exclusions that are in place. If you only want the directory and its immediate contents to be added, then pass in
recursive=False
.Note: If a directory’s absolute path matches an exclude pattern or path, or if the directory contains the configured ignore file, then the directory and all of its contents will be recursively excluded from the list.
Note: If the passed-in directory happens to be a soft link, it will be recursed. However, the linkDepth parameter controls whether any soft links within the directory will be recursed. The link depth is maximum depth of the tree at which soft links should be followed. So, a depth of 0 does not follow any soft links, a depth of 1 follows only links within the passed-in directory, a depth of 2 follows the links at the next level down, etc.
Note: Any invalid soft links (i.e. soft links that point to non-existent items) will be silently ignored.
Note: The
excludeDirs
flag only controls whether any given directory path itself is added to the list once it has been discovered. It does not modify any behavior related to directory recursion.Note: If you call this method on a link to a directory that link will never be dereferenced (it may, however, be followed).
- Parameters
path (String representing a path on disk) – Directory path whose contents should be added to the list
recursive (Boolean value) – Indicates whether directory contents should be added recursively
addSelf (Boolean value) – Indicates whether the directory itself should be added to the list
linkDepth (Integer value) – Maximum depth of the tree at which soft links should be followed, zero means not to folow
dereference (Boolean value) – Indicates whether soft links, if followed, should be dereferenced
- Returns
Number of items recursively added to the list
- Raises
ValueError – If path is not a directory or does not exist
ValueError – If the path could not be encoded properly
- removeFiles(pattern=None)¶
Removes file entries from the list.
If
pattern
is not passed in or isNone
, then all file entries will be removed from the list. Otherwise, only those file entries matching the pattern will be removed. Any entry which does not exist on disk will be ignored (useremoveInvalid
to purge those entries).This method might be fairly slow for large lists, since it must check the type of each item in the list. If you know ahead of time that you want to exclude all files, then you will be better off setting
excludeFiles
toTrue
before adding items to the list.- Parameters
pattern – Regular expression pattern representing entries to remove
- Returns
Number of entries removed
- Raises
ValueError – If the passed-in pattern is not a valid regular expression
- removeDirs(pattern=None)¶
Removes directory entries from the list.
If
pattern
is not passed in or isNone
, then all directory entries will be removed from the list. Otherwise, only those directory entries matching the pattern will be removed. Any entry which does not exist on disk will be ignored (useremoveInvalid
to purge those entries).This method might be fairly slow for large lists, since it must check the type of each item in the list. If you know ahead of time that you want to exclude all directories, then you will be better off setting
excludeDirs
toTrue
before adding items to the list (note that this will not prevent you from recursively adding the contents of directories).- Parameters
pattern – Regular expression pattern representing entries to remove
- Returns
Number of entries removed
- Raises
ValueError – If the passed-in pattern is not a valid regular expression
- removeLinks(pattern=None)¶
Removes soft link entries from the list.
If
pattern
is not passed in or isNone
, then all soft link entries will be removed from the list. Otherwise, only those soft link entries matching the pattern will be removed. Any entry which does not exist on disk will be ignored (useremoveInvalid
to purge those entries).This method might be fairly slow for large lists, since it must check the type of each item in the list. If you know ahead of time that you want to exclude all soft links, then you will be better off setting
excludeLinks
toTrue
before adding items to the list.- Parameters
pattern – Regular expression pattern representing entries to remove
- Returns
Number of entries removed
- Raises
ValueError – If the passed-in pattern is not a valid regular expression
- removeMatch(pattern)¶
Removes from the list all entries matching a pattern.
This method removes from the list all entries which match the passed in
pattern
. Since there is no need to check the type of each entry, it is faster to call this method than to call theremoveFiles
,removeDirs
orremoveLinks
methods individually. If you know which patterns you will want to remove ahead of time, you may be better off settingexcludePatterns
orexcludeBasenamePatterns
before adding items to the list.Note: Unlike when using the exclude lists, the pattern here is not bounded at the front and the back of the string. You can use any pattern you want.
- Parameters
pattern – Regular expression pattern representing entries to remove
- Returns
Number of entries removed
- Raises
ValueError – If the passed-in pattern is not a valid regular expression
- removeInvalid()¶
Removes from the list all entries that do not exist on disk.
This method removes from the list all entries which do not currently exist on disk in some form. No attention is paid to whether the entries are files or directories.
- Returns
Number of entries removed
- normalize()¶
Normalizes the list, ensuring that each entry is unique.
- verify()¶
Verifies that all entries in the list exist on disk. :returns:
True
if all entries exist,False
otherwise
- class CedarBackup3.filesystem.SpanItem(fileList, size, capacity, utilization)¶
Bases:
object
Item returned by
BackupFileList.generateSpan
.
- class CedarBackup3.filesystem.BackupFileList¶
Bases:
FilesystemList
List of files to be backed up.
A BackupFileList is a
FilesystemList
containing a list of files to be backed up. It only contains files, not directories (soft links are treated like files). On top of the generic functionality provided byFilesystemList
, this class adds functionality to keep a hash (checksum) for each file in the list, and it also provides a method to calculate the total size of the files in the list and a way to export the list into tar form.- addDir(path)¶
Adds a directory to the list.
Note that this class does not allow directories to be added by themselves (a backup list contains only files). However, since links to directories are technically files, we allow them to be added.
This method is implemented in terms of the superclass method, with one additional validation: the superclass method is only called if the passed-in path is both a directory and a link. All of the superclass’s existing validations and restrictions apply.
- Parameters
path (String representing a path on disk) – Directory path to be added to the list
- Returns
Number of items added to the list
- Raises
ValueError – If path is not a directory or does not exist
ValueError – If the path could not be encoded properly
- totalSize()¶
Returns the total size among all files in the list. Only files are counted. Soft links that point at files are ignored. Entries which do not exist on disk are ignored. :returns: Total size, in bytes
- generateSizeMap()¶
Generates a mapping from file to file size in bytes. The mapping does include soft links, which are listed with size zero. Entries which do not exist on disk are ignored. :returns: Dictionary mapping file to file size
- generateDigestMap(stripPrefix=None)¶
Generates a mapping from file to file digest.
Currently, the digest is an SHA hash, which should be pretty secure. In the future, this might be a different kind of hash, but we guarantee that the type of the hash will not change unless the library major version number is bumped.
Entries which do not exist on disk are ignored.
Soft links are ignored. We would end up generating a digest for the file that the soft link points at, which doesn’t make any sense.
If
stripPrefix
is passed in, then that prefix will be stripped from each key when the map is generated. This can be useful in generating two “relative” digest maps to be compared to one another.- Parameters
stripPrefix (String with any contents) – Common prefix to be stripped from paths
- Returns
Dictionary mapping file to digest value
@see:
removeUnchanged
- generateFitted(capacity, algorithm='worst_fit')¶
Generates a list of items that fit in the indicated capacity.
Sometimes, callers would like to include every item in a list, but are unable to because not all of the items fit in the space available. This method returns a copy of the list, containing only the items that fit in a given capacity. A copy is returned so that we don’t lose any information if for some reason the fitted list is unsatisfactory.
The fitting is done using the functions in the knapsack module. By default, the first fit algorithm is used, but you can also choose from best fit, worst fit and alternate fit.
- Parameters
capacity (Integer, in bytes) – Maximum capacity among the files in the new list
algorithm (One of "first_fit", "best_fit", "worst_fit", "alternate_fit") – Knapsack (fit) algorithm to use
- Returns
Copy of list with total size no larger than indicated capacity
- Raises
ValueError – If the algorithm is invalid
- generateSpan(capacity, algorithm='worst_fit')¶
Splits the list of items into sub-lists that fit in a given capacity.
Sometimes, callers need split to a backup file list into a set of smaller lists. For instance, you could use this to “span” the files across a set of discs.
The fitting is done using the functions in the knapsack module. By default, the first fit algorithm is used, but you can also choose from best fit, worst fit and alternate fit.
Note: If any of your items are larger than the capacity, then it won’t be possible to find a solution. In this case, a value error will be raised.
- Parameters
capacity (Integer, in bytes) – Maximum capacity among the files in the new list
algorithm (One of "first_fit", "best_fit", "worst_fit", "alternate_fit") – Knapsack (fit) algorithm to use
- Returns
List of
SpanItem
objects- Raises
ValueError – If the algorithm is invalid
ValueError – If it’s not possible to fit some items
- generateTarfile(path, mode='tar', ignore=False, flat=False)¶
Creates a tar file containing the files in the list.
By default, this method will create uncompressed tar files. If you pass in mode
'targz'
, then it will create gzipped tar files, and if you pass in mode'tarbz2'
, then it will create bzipped tar files.The tar file will be created as a GNU tar archive, which enables extended file name lengths, etc. Since GNU tar is so prevalent, I’ve decided that the extra functionality out-weighs the disadvantage of not being “standard”.
If you pass in
flat=True
, then a “flat” archive will be created, and all of the files will be added to the root of the archive. So, the file/tmp/something/whatever.txt
would be added as justwhatever.txt
.By default, the whole method call fails if there are problems adding any of the files to the archive, resulting in an exception. Under these circumstances, callers are advised that they might want to call
removeInvalid
and then attempt to extract the tar file a second time, since the most common cause of failures is a missing file (a file that existed when the list was built, but is gone again by the time the tar file is built).If you want to, you can pass in
ignore=True
, and the method will ignore errors encountered when adding individual files to the archive (but not errors opening and closing the archive itself).We’ll always attempt to remove the tarfile from disk if an exception will be thrown.
Note: No validation is done as to whether the entries in the list are files, since only files or soft links should be in an object like this. However, to be safe, everything is explicitly added to the tar archive non-recursively so it’s safe to include soft links to directories.
Note: The Python
tarfile
module, which is used internally here, is supposed to deal properly with long filenames and links. In my testing, I have found that it appears to be able to add long really long filenames to archives, but doesn’t do a good job reading them back out, even out of an archive it created. Fortunately, all Cedar Backup does is add files to archives.- Parameters
path (String representing a path on disk) – Path of tar file to create on disk
mode (One of either
'tar'
,'targz'
or'tarbz2'
) – Tar creation modeignore (Boolean) – Indicates whether to ignore certain errors
flat (Boolean) – Creates “flat” archive by putting all items in root
- Raises
ValueError – If mode is not valid
ValueError – If list is empty
ValueError – If the path could not be encoded properly
TarError – If there is a problem creating the tar file
- removeUnchanged(digestMap, captureDigest=False)¶
Removes unchanged entries from the list.
This method relies on a digest map as returned from
generateDigestMap
. For each entry indigestMap
, if the entry also exists in the current list and the entry in the current list has the same digest value as in the map, the entry in the current list will be removed.This method offers a convenient way for callers to filter unneeded entries from a list. The idea is that a caller will capture a digest map from
generateDigestMap
at some point in time (perhaps the beginning of the week), and will save off that map usingpickle
or some other method. Then, the caller could use this method sometime in the future to filter out any unchanged files based on the saved-off map.If
captureDigest
is passed-in asTrue
, then digest information will be captured for the entire list before the removal step occurs using the same rules as ingenerateDigestMap
. The check will involve a lookup into the complete digest map.If
captureDigest
is passed in asFalse
, we will only generate a digest value for files we actually need to check, and we’ll ignore any entry in the list which isn’t a file that currently exists on disk.The return value varies depending on
captureDigest
, as well. To preserve backwards compatibility, ifcaptureDigest
isFalse
, then we’ll just return a single value representing the number of entries removed. Otherwise, we’ll return a tuple of C{(entries removed, digest map)}. The returned digest map will be in exactly the form returned bygenerateDigestMap
.Note: For performance reasons, this method actually ends up rebuilding the list from scratch. First, we build a temporary dictionary containing all of the items from the original list. Then, we remove items as needed from the dictionary (which is faster than the equivalent operation on a list). Finally, we replace the contents of the current list based on the keys left in the dictionary. This should be transparent to the caller.
- Parameters
digestMap (Map as returned from
generateDigestMap
) – Dictionary mapping file name to digest valuecaptureDigest (Boolean) – Indicates that digest information should be captured
- Returns
Results as discussed above (format varies based on arguments)
- class CedarBackup3.filesystem.PurgeItemList¶
Bases:
FilesystemList
List of files and directories to be purged.
A PurgeItemList is a
FilesystemList
containing a list of files and directories to be purged. On top of the generic functionality provided byFilesystemList
, this class adds functionality to remove items that are too young to be purged, and to actually remove each item in the list from the filesystem.The other main difference is that when you add a directory’s contents to a purge item list, the directory itself is not added to the list. This way, if someone asks to purge within in
/opt/backup/collect
, that directory doesn’t get removed once all of the files within it is gone.- addDirContents(path, recursive=True, addSelf=True, linkDepth=0, dereference=False)¶
Adds the contents of a directory to the list.
The path must exist and must be a directory or a link to a directory. The contents of the directory (but not the directory path itself) will be recursively added to the list, subject to any exclusions that are in place. If you only want the directory and its contents to be added, then pass in
recursive=False
.Note: If a directory’s absolute path matches an exclude pattern or path, or if the directory contains the configured ignore file, then the directory and all of its contents will be recursively excluded from the list.
Note: If the passed-in directory happens to be a soft link, it will be recursed. However, the linkDepth parameter controls whether any soft links within the directory will be recursed. The link depth is maximum depth of the tree at which soft links should be followed. So, a depth of 0 does not follow any soft links, a depth of 1 follows only links within the passed-in directory, a depth of 2 follows the links at the next level down, etc.
Note: Any invalid soft links (i.e. soft links that point to non-existent items) will be silently ignored.
Note: The
excludeDirs
flag only controls whether any given soft link path itself is added to the list once it has been discovered. It does not modify any behavior related to directory recursion.Note: The
excludeDirs
flag only controls whether any given directory path itself is added to the list once it has been discovered. It does not modify any behavior related to directory recursion.Note: If you call this method on a link to a directory that link will never be dereferenced (it may, however, be followed).
- Parameters
path (String representing a path on disk) – Directory path whose contents should be added to the list
recursive (Boolean value) – Indicates whether directory contents should be added recursively
addSelf – Ignored in this subclass
linkDepth (Integer value, where zero means not to follow any soft links) – Depth of soft links that should be followed
dereference (Boolean value) – Indicates whether soft links, if followed, should be dereferenced
- Returns
Number of items recursively added to the list
- Raises
ValueError – If path is not a directory or does not exist
ValueError – If the path could not be encoded properly
- removeYoungFiles(daysOld)¶
Removes from the list files younger than a certain age (in days).
Any file whose “age” in days is less than (
<
) the value of thedaysOld
parameter will be removed from the list so that it will not be purged later whenpurgeItems
is called. Directories and soft links will be ignored.The “age” of a file is the amount of time since the file was last used, per the most recent of the file’s
st_atime
andst_mtime
values.Note: Some people find the “sense” of this method confusing or “backwards”. Keep in mind that this method is used to remove items from the list, not from the filesystem! It removes from the list those items that you would not want to purge because they are too young. As an example, passing in
daysOld
of zero (0) would remove from the list no files, which would result in purging all of the files later. I would be happy to make a synonym of this method with an easier-to-understand “sense”, if someone can suggest one.- Parameters
daysOld (Integer value >= 0) – Minimum age of files that are to be kept in the list
- Returns
Number of entries removed
- purgeItems()¶
Purges all items in the list.
Every item in the list will be purged. Directories in the list will not be purged recursively, and hence will only be removed if they are empty. Errors will be ignored.
To faciliate easy removal of directories that will end up being empty, the delete process happens in two passes: files first (including soft links), then directories.
- Returns
Tuple containing count of (files, dirs) removed
- CedarBackup3.filesystem.normalizeFile(path)¶
Normalizes a file name.
On Windows in particular, we often end up with mixed slashes, where parts of a path have forward slash and parts have backward slash. This makes it difficult to construct exclusions in configuration, because you never know what part of a path will have what kind of slash. I’ve decided to standardize on forward slashes.
- Parameters
path (String representing a path on disk) – Path to be normalized
- Returns
Normalized path, which should be equivalent to the original
- CedarBackup3.filesystem.normalizeDir(path)¶
Normalizes a directory name.
For our purposes, a directory name is normalized by removing the trailing path separator, if any. This is important because we want directories to appear within lists in a consistent way, although from the user’s perspective passing in
/path/to/dir/
and/path/to/dir
are equivalent.We also convert slashes. On Windows in particular, we often end up with mixed slashes, where parts of a path have forward slash and parts have backward slash. This makes it difficult to construct exclusions in configuration, because you never know what part of a path will have what kind of slash. I’ve decided to standardize on forward slashes.
- Parameters
path (String representing a path on disk) – Path to be normalized
- Returns
Normalized path, which should be equivalent to the original
- CedarBackup3.filesystem.compareContents(path1, path2, verbose=False)¶
Compares the contents of two directories to see if they are equivalent.
The two directories are recursively compared. First, we check whether they contain exactly the same set of files. Then, we check to see every given file has exactly the same contents in both directories.
This is all relatively simple to implement through the magic of
BackupFileList.generateDigestMap
, which knows how to strip a path prefix off the front of each entry in the mapping it generates. This makes our comparison as simple as creating a list for each path, then generating a digest map for each path and comparing the two.If no exception is thrown, the two directories are considered identical.
If the
verbose
flag isTrue
, then an alternate (but slower) method is used so that any thrown exception can indicate exactly which file caused the comparison to fail. The thrownValueError
exception distinguishes between the directories containing different files, and containing the same files with differing content.Note: Symlinks are not followed for the purposes of this comparison.
- Parameters
path1 (String representing a path on disk) – First path to compare
path2 (String representing a path on disk) – First path to compare
verbose (Boolean) – Indicates whether a verbose response should be given
- Raises
ValueError – If a directory doesn’t exist or can’t be read
ValueError – If the two directories are not equivalent
IOError – If there is an unusual problem reading the directories
- CedarBackup3.filesystem.compareDigestMaps(digest1, digest2, verbose=False)¶
Compares two digest maps and throws an exception if they differ.
- Parameters
digest1 (Digest as returned from BackupFileList.generateDigestMap()) – First digest to compare
digest2 (Digest as returned from BackupFileList.generateDigestMap()) – Second digest to compare
verbose (Boolean) – Indicates whether a verbose response should be given
- Raises
ValueError – If the two directories are not equivalent