Table of Contents
CSV (Comma Separated Values) is a file format used to store tabular data, such as spreadsheets or databases, in a simple text format. CSV files consist of lines, where each line corresponds to a row in the table, and each row is divided into cells by a delimiter, typically a comma.
CSV files can be opened and edited using spreadsheet software like Microsoft Excel or Google Sheets.
Python is useful for reading and writing CSV files because it provides built-in support for handling CSV data through its csv module. The csv module in Python allows users to easily read, write, and manipulate CSV files using simple and intuitive functions and methods.
Python module helps automate tasks like data extraction, data cleaning, and data analysis.
Additionally, Python’s ability to work with other data formats, such as JSON, XML, and databases, makes it a powerful tool for data processing and manipulation.
Let’s see how we can use the Python csv module to read and write CSV file.
Importing the CSV module
To work with CSV files in Python, we need to import the csv
module, which provides functions for reading and writing CSV files. To import the CSV module, we can use the following code:
import csv
Reading a CSV file in Python
To read data from a CSV file using the CSV module, we first need to create a CSV reader object, which takes a file object and an optional delimiter as arguments.
The delimiter specifies the character used to separate the values in each row, which can be a comma, tab, semicolon, or any other character.
Here’s an example code snippet for reading a CSV file with a header and a comma delimiter:
import csv
with open('data.csv', newline='', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
header = next(reader)
for row in reader:
print(row)
In this code, we open the file ‘data.csv’ using the open()
function with the newline=''
and encoding='utf-8'
arguments, which ensure that the file is read correctly on different platforms and with different character encodings.
We then create a CSV reader object using the csv.reader()
function, passing in the file object and the delimiter (in this case, a comma) as arguments. We also specify the quotechar='"'
argument to handle quoted values.
Next, we use the next()
function to read the first row of the file, which contains the header names, and store it in the header
variable. We then iterate over the remaining rows using a for
loop, printing each row to the console.
Specifying delimiter and other file parameters:
When reading or writing CSV files in Python, we can specify various parameters to customize how the file is parsed or generated. Some common parameters include:
delimiter
: Specifies the character used to separate the values in each row. By default, the delimiter is a comma, but it can be changed to any other character, such as a tab or a semicolon.quotechar
: Specifies the character used to enclose values that contain the delimiter character. By default, the quote character is a double quote, but it can be changed to any other character, such as a single quote.quoting
: Specifies the quoting mode used to handle values that contain the quote character. The possible values arecsv.QUOTE_ALL
(quote all values),csv.QUOTE_MINIMAL
(quote values containing the delimiter or the quote character),csv.QUOTE_NONNUMERIC
(quote all non-numeric values), andcsv.QUOTE_NONE
(do not quote any values).encoding
: Specifies the character encoding used to read or write the file. By default, the encoding is'utf-8'
, but it can be changed to any other supported encoding.
Here’s an example code snippet for reading a CSV file with a semicolon delimiter and a single quote as the quote character:
import csv
with open('data.csv', newline='', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter=';', quotechar="'")
for row in reader:
print(row)
In this code, we pass the delimiter=';'
and quotechar="'"
arguments to the csv.reader()
function to specify the semicolon delimiter and the single quote quote character.
Skipping headers or other rows in the file:
When reading CSV files in Python, we can skip certain rows in the file, such as header rows or rows that contain invalid data. To skip a row, we can simply call the next()
function on the reader object, which reads and discards the next row in the file.
Here’s an example code snippet for reading a CSV file with a header row and skipping it:
import csv
with open('data.csv', newline='', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
next(reader) # skip the header row
for row in reader:
print(row)
In this code, we call the next(reader)
function to read and discard the first row of the file, which contains the header names. We then iterate over the remaining rows using a for
loop, printing each row to the console.
We can also skip other rows based on a condition, such as rows with missing or invalid data. In this case, we can use a try-except
block to catch and handle any errors that may occur while reading the row.
Here’s an example code snippet for reading a CSV file and skipping rows with missing data:
import csv
with open('data.csv', newline='', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
try:
# check if the row has missing data
if any(value.strip() == '' for value in row):
raise ValueError('Row has missing data')
# process the row
print(row)
except ValueError as e:
# handle the error
print(f'Skipping row: {e}')
In this code, we use a try-except
block to catch any ValueError
exceptions that may occur while processing a
Writing to a CSV file in Python
To write data to a CSV file using the CSV module, we first need to create a CSV writer object, which takes a file object and an optional delimiter as arguments. We can then use the writer object’s writerow()
method to write each row of data to the file.
Here’s an example code snippet for writing data in Array of Array format to a CSV file with a header and a comma delimiter:
import csv
data = [
['Name', 'Age', 'Country'],
['Alice', 25, 'USA'],
['Bob', 30, 'Canada'],
['Charlie', 35, 'Australia']
]
with open('output.csv', mode='w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in data:
writer.writerow(row)
In this code, we create a list of lists called data
, where the first list represents the header row and the remaining lists represent the data rows. We then open a new file called ‘output.csv’ using the open()
function with the mode='w'
, newline=''
, and encoding='utf-8'
arguments, which ensure that the
Writing data incrementally to CSV files:
When writing CSV files in Python, we can write data incrementally to the file, which is useful when we have a large amount of data that cannot fit into memory all at once. To do this, we can create a CSV writer object and write each row of data to the file using the writerow()
method.
Here’s an example code snippet for writing data incrementally to a CSV file with a header row and a comma delimiter:
import csv
header = ['Name', 'Age', 'Country']
with open('output.csv', mode='w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerow(header)
# write data incrementally
for i in range(1000000):
row = ['Person ' + str(i), i % 100, 'Country ' + str(i % 10)]
writer.writerow(row)
In this code, we create a header row as a list called header
, and then open a new file called ‘output.csv’ using the open()
function with the mode='w'
, newline=''
, and encoding='utf-8'
arguments, which ensure that the file is written correctly on different platforms and with different character encodings.
We then create a CSV writer object using the csv.writer()
function, passing in the file object and the delimiter (in this case, a comma) as arguments. We also specify the quotechar='"'
argument to handle quoted values, and the quoting=csv.QUOTE_MINIMAL
argument to quote only values containing the delimiter character.
Next, we write the header row to the file using the writerow()
method of the writer object. Finally, we write the data incrementally using a for
loop, generating each row of data on the fly and writing it to the file using the writerow()
method.
Appending data to an existing CSV file:
When appending data to an existing CSV file in Python, we can use the open()
function with the mode='a'
argument, which opens the file in append mode instead of write mode. We can then create a CSV writer object and write each row of data to the file using the writerow()
method.
Here’s an example code snippet for appending data to an existing CSV file with a header row and a comma delimiter:
import csv
header = ['Name', 'Age', 'Country']
new_data = [
['Alice', 25, 'USA'],
['Bob', 30, 'Canada'],
['Charlie', 35, 'Australia']
]
with open('data.csv', mode='a', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerow(header)
# append new data to the file
for row in new_data:
writer.writerow(row)
In this code, we create a header row as a list called header
, and a new list of data called new_data
. We then open an existing file called ‘data.csv’ using the open()
function with the mode='a'
, newline=''
, and encoding='utf-8'
arguments, which open the file in append mode instead of write mode.
We then create a CSV writer object using the csv.writer()
function, passing in the file object and the delimiter (in this case, a comma) as arguments. We also specify the quotechar='"'
argument to handle quoted
Encoding and Decoding CSV using Python
When writing CSV data to a file, Python needs to encode the Unicode string data into a byte string representation that can be written to the file. This is known as encoding. Conversely, when reading CSV data from a file, Python needs to decode the byte string data from the file into Unicode string data that can be used by the program. This is known as decoding.
Python’s csv module provides built-in support for encoding and decoding CSV data. When writing data to a CSV file, the module allows you to specify the encoding to use. When reading data from a CSV file, the module automatically detects the encoding of the file and decodes the data accordingly.
Encoding and decoding CSV files:
When reading or writing CSV files in Python, we need to ensure that the file is encoded and decoded correctly, especially if the file contains non-ASCII characters or uses a non-standard encoding.
We can specify the encoding used by the file by passing the encoding
argument to the open()
function when opening the file. When writing to a file, we can also specify the encoding used by the CSV writer object by passing the encoding
argument to the csv.writer()
function.
Here’s an example code snippet for reading a CSV file encoded in UTF-16:
import csv
with open('data.csv', newline='', encoding='utf-16') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
print(row)
In this code, we pass the encoding='utf-16'
argument to the open()
function to specify that the file is encoded in UTF-16.
Here’s an example code snippet for writing a CSV file encoded in ISO-8859-1:
import csv
header = ['Name', 'Age', 'Country']
data = [
['Alice', 25, 'USA'],
['Bob', 30, 'Canada'],
['Charlie', 35, 'Australia']
]
with open('output.csv', mode='w', newline='', encoding='iso-8859-1') as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL, encoding='iso-8859-1')
writer.writerow(header)
for row in data:
writer.writerow(row)
In this code, we pass the encoding='iso-8859-1'
argument to the open()
function to specify that the file should be written in ISO-8859-1 encoding. We also pass the encoding='iso-8859-1'
argument to the csv.writer()
function to specify the encoding used by the CSV writer object.
Parsing Custom CSV file in Python
When parsing CSV files in Python, we may encounter complex or non-standard data formats that require advanced parsing techniques. Here are some examples of advanced parsing techniques that can be used with the CSV module:
Parsing quoted values with embedded delimiters:
If a value in a CSV file contains the delimiter character (e.g., a comma), the value is typically enclosed in quotes. To parse such values correctly, we need to handle the quotes and the embedded delimiter properly. We can do this by using the csv.reader()
function with the quoting=csv.QUOTE_ALL
argument, which tells the parser to treat all values as quoted values, and then removing the quotes manually.
Here’s an example code snippet for parsing quoted values with embedded delimiters:
import csv
with open('data.csv', newline='', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
for row in reader:
# remove quotes from values containing the delimiter
row = [value[1:-1] if value.startswith('"') and value.endswith('"') else value for value in row]
print(row)
Parsing values with multi-line text:
If a value in a CSV file contains line breaks, the value may span multiple lines. To parse such values correctly, we need to handle the line breaks properly. We can do this by using the csv.reader()
function with the lineterminator
argument, which specifies the character used to terminate lines, and then concatenating the lines manually.
import csv
with open('data.csv', newline='', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"', lineterminator='\n')
for row in reader:
# join multi-line values
for i, value in enumerate(row):
if '\n' in value:
row[i] = value.replace('\n', '')
print(row)
In this code, we pass the lineterminator='\n'
argument to the csv.reader()
function to specify that line breaks are used to terminate rows. We then iterate over each row of the file and, for each value in the row, check if it contains a line break ('\n'
). If it does, we remove the line break from the value using the replace()
method.
Parsing values with custom quoting and escape characters:
If a CSV file uses custom quoting or escape characters (i.e., characters other than the default double-quote and backslash), we need to specify these characters explicitly. We can do this by using the csv.reader()
function with the quotechar
and escapechar
arguments, which specify the custom quoting and escape characters, respectively.
Here’s an example code snippet for parsing a CSV file with custom quoting and escape characters:
import csv
with open('data.csv', newline='', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar="'", escapechar='\\')
for row in reader:
print(row)
In this code, we pass the quotechar="'"
and escapechar='\\'
arguments to the csv.reader()
function to specify the single-quote character as the quote character and the backslash character as the escape character.
Parsing values with a variable number of fields:
If a CSV file has a variable number of fields per row (e.g., some rows have additional columns), we need to handle this dynamically. We can do this by using the csv.reader()
function with the restkey
argument, which specifies the name of a key to store any additional fields that are not mapped to existing keys.
Here’s an example code snippet for parsing a CSV file with a variable number of fields:
import csv
with open('data.csv', newline='', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"', restkey='extra')
for row in reader:
print(row, row[-1])
In this code, we pass the restkey='extra'
argument to the csv.reader()
function to specify the name of the key to store any additional fields. We then print each row and the last value of the extra
key, which contains any additional fields.
Reading and Writing Dictionary to CSV
DictReader
and DictWriter
are classes provided by the CSV module in Python that allow us to read and write CSV files as dictionaries instead of lists. Specifically, DictReader
reads each row of the CSV file as a dictionary where the keys are the column names (i.e., the header row), and DictWriter
writes dictionaries to the CSV file where the keys correspond to the column names.
The advantage of using DictReader
and DictWriter
is that it allows us to work with CSV data in a more intuitive and flexible way, as we can access the values in each row using the column names rather than relying on their position in a list. This makes the code easier to read and maintain, especially when working with large or complex CSV files with many columns.
Here’s an example code snippet for reading a CSV file using DictReader
:
import csv
with open('data.csv', newline='', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['Name'], row['Age'], row['Country'])
In this code, we open a CSV file called ‘data.csv’ and create a DictReader
object using the csv.DictReader()
function. We then iterate over each row of the file and print the values of the ‘Name’, ‘Age’, and ‘Country’ columns using the dictionary keys.
Here’s an example code snippet for writing a CSV file using DictWriter
:
import csv
header = ['Name', 'Age', 'Country']
data = [
{'Name': 'Alice', 'Age': 25, 'Country': 'USA'},
{'Name': 'Bob', 'Age': 30, 'Country': 'Canada'},
{'Name': 'Charlie', 'Age': 35, 'Country': 'Australia'}
]
with open('output.csv', mode='w', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=header)
writer.writeheader()
for row in data:
writer.writerow(row)
In this code, we create a header row as a list called header
, and a list of dictionaries called data
, where each dictionary corresponds to a row of data and contains keys corresponding to the column names.
We then open a new file called ‘output.csv’ using the open()
function with the mode='w'
, newline=''
, and encoding='utf-8'
arguments. We then create a DictWriter
object using the csv.DictWriter()
function, passing in the file object and the header row as arguments.
We then write the header row to the file using the writeheader()
method of the writer object, and write the data using a for
loop and the writerow()
method.
One potential disadvantage of using DictReader
and DictWriter
is that they may be slightly slower and use more memory than their list-based counterparts, especially when working with very large CSV files. However, the performance difference is typically negligible for most use cases.
Another disadvantage is that DictReader
and DictWriter
maybe less flexible than the list-based alternatives when dealing with CSV files that have non-standard formats or do not have a header row.