VeriTAR – Verify checksums of files within a TAR archive

In my opinion, the biggest problem of the tar format (‘ustar‘) is that it does not store the checksums of the files it contains. So, in order to be able to verify the contents of the tar archive, you either need to keep the original data on the hard drive and compare the archive contents against that data using the -d tar switch or keep the MD5 sums of the files in a separate document and also use an external program in order to check them against the calculated MD5 sums of the archived files. In this short post I introduce you to a method of creating tar archives and keeping the md5sums of the files at the same time and a utility, veritar, which can compare those md5 sums with the checksums of the contents of the archive in-place, without the need to extract.

Creation of the TAR archive and the MD5 sums file

In the following example it is assumed that the files to backup reside in the myfiles/ subdirectory, the name of the tar archive will be mybackup.tar and the name of the file containing the md5sums will be mybackup.md5.

$ tar -cvpf mybackup.tar myfiles/ \
    | xargs -I '{}' sh -c "test -f '{}' && md5sum '{}'" \
    | tee mybackup.md5

Some notes:

  • You can use any tar switch for the creation of the archive except -C. If you need to change to another directory, do it using cd or else no md5 sums will be recorded.
  • Make sure that you include the -v (–verbose) switch when invoking tar, as the paths need to be printed to stdout in order to be processed by xargs.
  • In the xargs statement, the -I ‘{}’ part indicates that the '{}' string will be replaced by the path that is passed to xargs through the pipe.
  • The sh -c “test -f ‘{}’ && md5sum ‘{}'” does two things: tests if the path ('{}') is a file and calculates the md5 sum for it.
  • In the last part, tee is used in order to print the md5sum to the stdout and also to the mybackup.md5 file.

When this operation ends, you will end up with two files: mybackup.tar and mybackup.md5.

Special thanks to:

* Anvil for the suggestion to use bash -c "...test goes here..." stuff.
* Giorgos Keramidas for the improvement he suggested, so that the md5 sum calculation is not limited to regular files only:

sh -c "test -d '{}' || md5sum '{}'"

VeriTAR will verify the md5 sums of regular files only, so either test you use when creating the TAR archive, it is still fine.

VeriTAR – Tar archive verification

VeriTAR [Veri(fy)TAR] is a command-line utility that verifies the md5 sums of files within a tar archive. Due to the tar (‘ustar‘) format limitations the md5 sums are retrieved from a separate file and are checked against the md5 sums of the files within the tar archive. The process takes place without actually exctracting the files.

It works with corrupted tar archives. The program carries on to the next file within the archive skipping the damaged parts. At the moment, this relies
on Python’s tarfile module internal functions.

VeriTAR is written in Python.

Works with compressed TAR archives (gzip or bz2).

Provided that you have used the method above (or any other method) in order to create a file with the md5 sums together with the tar archive, you can easily verify the contents of the archive with veritar.

$ veritar mybackup.tar mybackup.md5

Please not that veritar’s output and command line switched need some work, but for now it does the job.

Veritar is released under the Apache License version 2.

It is completely unsupported, but you can still get community support at our software forums. This is also the place where you can inform me about any bugs.

Known issues

  1. Multi-volume tar archives are not supported at the moment
  2. Tar archives in which the metadata of the first archived file has been corrupted cannot be processed due to a limitation in the tarfile Python module at the time of writing
  3. Although the checksum of any algorithm, md5, sha1, crc(crc32), could be used, the current alpha version is not very flexible.
  4. It may crash on damaged archives on older python versions.

VeriTAR – Verify checksums of files within a TAR archive by George Notaras is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright © 2007 - Some Rights Reserved

8 responses on “VeriTAR – Verify checksums of files within a TAR archive

  1. Jorrit Permalink →

    The command has problems with filenames with single quotes in them. I use :

    tar -cvpf mybackup.tar myfiles/ \
    | tr ‘\n’ ”
    | xargs –null -I ‘{}’ sh -c “test -f ‘{}’ && md5sum ‘{}'” \
    | tee mybackup.md5

  2. Alexandru Permalink →

    Hello George,

    Thank you for your work on veritar. I have found a case in which it is not working – file names with two consecutive blanks – and I have a simple patch to correct it, see below.
    Alexandru

    diff VeriTAR-orig.py VeriTAR.py
    85c85,87
    md5index=line.find(” “)
    > csum=line[0:md5index]
    > name=line[md5index:]

  3. Alexandru Permalink →

    sorry, the diff was trashed by html; i am trying again surrounded with code tags
    Alexandru

    diff VeriTAR-orig.py VeriTAR.py
    85c85,87
                           md5index=line.find("  ")
    >                       csum=line[0:md5index]
    >                       name=line[md5index:]
    
  4. Alexandru Permalink →

    No luck again :( , anyway I have replaced line 85 in VeriTAR.py (csum, name = line.split(” “) ) with the next 3 lines
    md5index=line.find(” “)
    csum=line[0:md5index]
    name=line[md5index:]

    Alexandru

  5. Monica Permalink →

    Hello,

    I have been trying to use the veritar.py in a bash script and it seems to work, but I always get the following messages in the error file:

    veritar-0.3.0/VeriTAR.py:31: DeprecationWarning: the md5 module is deprecated; use hashlib instead
      import md5	# for compatibility with older Python versions
    Traceback (most recent call last):
      File "veritar-0.3.0/veritar", line 25, in 
        VeriTAR.main()
      File "veritar-0.3.0/VeriTAR.py", line 294, in main
        x.verify()
      File "veritar-0.3.0/VeriTAR.py", line 239, in verify
        checksum = self.__member_md5sum(member)
      File "veritar-0.3.0/VeriTAR.py", line 210, in __member_md5sum
        checksum = get_member_md5sum(f)
      File "veritar-0.3.0/VeriTAR.py", line 72, in get_member_md5sum
        data = f.read(READ_BLOCK)
      File "/usr/lib64/python2.6/tarfile.py", line 816, in read
        buf += self.fileobj.read(size - len(buf))
      File "/usr/lib64/python2.6/tarfile.py", line 734, in read
        return self.readnormal(size)
      File "/usr/lib64/python2.6/tarfile.py", line 743, in readnormal
        return self.fileobj.read(size)
      File "/usr/lib64/python2.6/tarfile.py", line 567, in read
        buf = self._read(size)
      File "/usr/lib64/python2.6/tarfile.py", line 584, in _read
        buf = self.cmp.decompress(buf)
    EOFError: end of stream was already found
    *****************************************************************
    

    Could you please advice me what I should do to fix this?

    Thank you.

    Best,
    Monica

    1. George Notaras Post authorPermalink →

      Hello Monica,

      Thanks for your feedback. I’m sorry for the late reply, but your comment for some reason had been caught by the spam filter.

      As far as I can remember, I had developed this script using python 2.4. Maybe some things have changed since then in the tarfile module of the Standard Library. Time permitting I’ll try to take a look at it again. BTW, there is a 0.4.0 (refactored) release over at Github: https://github.com/gnotaras/veritar/releases I have no idea if it’s going to work though..

      The point is that I’ve stopped using or developing this script, so I highly recommend using an alternative utility that is actively maintained to get this job done.

      Best Regards,
      George