Creating a python script to extract file information from a directory (2024)

Aalok Trivedi

7 min read

Feb 12, 2023

Goals

Create a function that gets a list of all files in a given directory.
Extract the file name, path, size, and creation date of every file, and put this info into a list.
Output the list into a readable JSON format.

Prerequisites

Basic knowledge of python methods (functions, lists, dictionaries, for-loops)
an IDE, such as VS Code.

First, let’s import the necessary python libraries to build this script:

os: Allows access to various operating system methods.
datetime: Allows us to convert raw time outputs into a readable format.
json: Allows us to convert data into JSON format.

import os
import datetime
import json

The os module gives us the time in units of seconds, which isn’t very helpful for us lowly humans so, let’s first create a helper function that will allow us to convert the creation date into a readable format.

In this function, we’ll use a parameter of timestamp to pass in the creation date (seconds) later on.

def convertDate(timestamp):
 d = datetime.datetime.utcfromtimestamp(timestamp)
 formatedDate = d.strftime('%b %d, %Y') return formatedDate

The datetime.utcfromtimestamp() method converts the seconds to a UTC formatted timestamp.

For example, datetime.datetime.utcfromtimestamp(10000) will give us an output of 1970–01–01 02:46:40 . This is a totally acceptable output, but if we wanted a more readable format, we can use the .strftime() method to provide us with a string in ‘mm dd yyyy’ format.

Here is a full list of format codes you can use with this method

Let’s test the function by printing the output for 10000 seconds.

print(convertDate(10000))

Creating a python script to extract file information from a directory (2)

Great! Now that our convertDate() function is working, we can work on the meat-and-potatoes of the script, which will extract file information from a given path.

For more info on the datetime library and module, check out the official documentation

To get specific file and directory information, we will use the os library and the various methods it provides.

Here’s a breakdown of what we need this function to do:

Accept a custom path or default it to the current working directory.
Take the path and find every file within the directory and its subdirectories.
Extract the name, path, size, and creation date of each file
Take the file information and put it into a dictionary and list.
Take that list of dictionaries and convert it into JSON format.

Let’s start by defining the function getDirDetails() that will take a path parameter and default to the current working directory.

To get the current working directory, we can use the os.getcwd() method.

Set the path parameter equal to the cwd as the default and test the function.

def getDirDetails(path=os.getcwd()): return path
print(getDirDetails())

Creating a python script to extract file information from a directory (3)

The first thing we need to do is create some variables that will determine helpful information about the path that will be passed in.

def getDirDetails(path=os.getcwd()):
 fileList = []
 pathExists = os.path.exists(path)
 isFile = os.path.isfile(path)
 isDir = os.path.isdir(path)

fileList = [] : An empty list that will eventually store our file info.

pathExists = os.path.exists(path) : Checks if the path provided exists. Returns True or False.

isFile = os.path.isFile(path) : Checks if the path provided is a File. Returns True or False.

isDir = os.path.isDir(path) : Checks if the path provided is a Directory. Returns True or False.

The last three will help us with error handling and validation.

Error handling

We only want the function to execute if the path provided exists AND is a directory. If the path does not exist or leads to a file → provide an error statement. We can use multiple if-else statements to check for each case.

def getDirDetails(path=os.getcwd()):
 fileList = []
 pathExists = os.path.exists(path)
 isFile = os.path.isfile(path)
 isDir = os.path.isdir(path) if pathExists and isDir:
 #do stuff
 print(f"'{path}' is a directory.")
 elif pathExists and isFile:
 print(f"Error: The path '{path}' must be a directory.")
 elif pathExists == False:
 print(f"Error: The path '{path}' does not exist.")
 #test it out
print(getDirDetails()) #cwd: should pass
print(getDirDetails("/Users/aaloktrivedi/package-lock.json")) #file: should fail
print(getDirDetails("/Users/sdsvdvd")) #invalid path: should fail

Creating a python script to extract file information from a directory (4)

Fantastic! The error handling works as expected.

NOTE: I know python has built-in error-handling methods, so this might not be the best way to tackle validation, but for our purposes, this will work just fine.

Iterate through the directory

Now we’re ready to iterate through our directory to get all the files. One way we can do this is by utilizing the os.walk() method. This method will generate all files in a given directory tree and return a tuple of (dirpath, dirnames, filenames).

We can use a for-loop to iterate through these tuples to get individual info. We’ll want the root path: root, directories: dirs , and files: files

We then want to use another for-loop to iterate through just the files to get all the individual files.

def getDirDetails(path=os.getcwd()):
 fileList = [] pathExists = os.path.exists(path)
 isFile = os.path.isfile(path)
 isDir = os.path.isdir(path)
 if pathExists and isDir:
 for root, dirs, files in os.walk(path):
 for file in files:
 print(file)
...
#outside of the function
getDirDetails("/Users/aaloktrivedi/LUIT/Projects/LUIT_python")

Creating a python script to extract file information from a directory (5)

Perfect! We now have access to every file, even inside subdirectories.

Let’s get more information about these files, such as the path, size, and creation date, and store them in variables for later use.

For the path, we’ll actually need to use a special join within the os.path method. We have access to the root path and the file name separately, but we can use os.path.join() to put them back together.

filePath = os.path.join(root, file)

For the file size, we can use the os.path.getSize() method, which returns the size in bytes. Again, we need to pass in the whole path, so we can use the filePath variable we just created.

#I divided the bytes by 1024 to convert it into kb (optional).
#remember you need to pass in the whole file path, not just the file name.fileSize = round(os.path.getsize(filePath) / 1024, 1)

Remember that convertDate() helper function we created earlier? Time to use it to convert our file creation date. We can use the os.path.getctime() method to get the time in seconds and pass that into our helper function.

fileCreationDate = convertDate(os.path.getctime(filePath))

So far, our function code should look like this:

def getDirDetails(path=os.getcwd()):
 fileList = [] pathExists = os.path.exists(path)
 isFile = os.path.isfile(path)
 isDir = os.path.isdir(path)
 if pathExists and isDir:
 for root, dirs, files in os.walk(path):
 for file in files:
 filePath = os.path.join(root, file)
 fileSize = round(os.path.getsize(filePath) / 1024, 1)
 fileCreationDate = convertDate(os.path.getctime(filePath))
 elif pathExists and isFile:
 print(f"Error: The path '{path}' must be a directory.")
 elif pathExists == False:
 print(f"Error: The path '{path}' does not exist.")

Store data and add to list

We have our file information, so now it’s time to store this data in a dictionary. We can then add the dictionary to the empty fileList we created.

#create a dict for each file
fileDict = {
 'file_name': file,
 'path': filePath,
 'size_kb': fileSize,
 "date_created": fileCreationDate
 }#append the dict to fileList
fileList.append(fileDict)
print(fileList)

Creating a python script to extract file information from a directory (6)

It works!… but this isn’t very readable. Let’s convert it to JSON format to make the data much more human-friendly. We can do this by using the json.dumps() method (make sure this is outside of the for-loops).

pathFilesJSON = json.dumps(fileList, indent=4)

Finally, we want to return the JSON data (outside of the for-loops). This ensures the function outputs the JSON whenever it’s called.

return pathFilesJSON

#outside of the function
print(getDirDetails("YOUR_PATH")

Creating a python script to extract file information from a directory (7)

Much better!

We just successfully created a python script that delivers data on every file within a directory!

Here is the full code:

Thank you for following me on my cloud computing journey. I hope this article was helpful and informative. Please give me a follow as I continue my journey, and I will share more articles like this!