Let’s see how to use Numpy genfromtxt function.
numpy.genfromtxt is particularly powerful because of its flexibility in handling various text file formats, including those with missing values, different data types within columns, and delimited structures. Unlike simpler loading functions, genfromtxt offers robust options for customization and error handling during the data loading process, making it suitable for real-world messy datasets.
Using genfromtxt method
The NumPy genfromtxt function lets you load file content into your Python code. I’m using genfromtxt like this:
import numpy as np import os os.chdir("C:/Users/Pythoneo/Documents/MyProjects") a = np.genfromtxt("data.csv", dtype='float', delimiter=',') print(a)
chdir allows setting the working directory. If not set, the genfromtxt function will use the current directory.
Genfromtxt Numpy function is having various parameters.
Beyond dtype and delimiter, numpy.genfromtxt offers a rich set of parameters to fine-tune data loading. Some commonly used parameters include:
skip_header: To skip a specified number of lines at the beginning of the file, often used to ignore header rows in data files.
names: To assign names to the columns of the resulting array, either by reading them from the header row (if names=True and a header exists) or by providing a list of names.
missing_values and filling_values: To handle missing data by specifying what strings should be treated as missing and what values should be used to fill in these missing entries, respectively.
converters: To apply custom functions to specific columns during data loading, allowing for on-the-fly data transformation or cleaning.
import numpy as np data_with_header = """Name,Age,City Alice,25,New York Bob,30,London Charlie,28,Paris""" from io import StringIO data_file = StringIO(data_with_header) loaded_data = np.genfromtxt(data_file, delimiter=',', skip_header=1, names=True, dtype=None, encoding=None) # dtype=None to infer, encoding=None for default print(loaded_data) print(loaded_data.dtype.names) # Print column names
Using skiprows and skip_header to Skip Rows
The skiprows and skip_header parameters allow you to control which rows at the beginning of the file are ignored during loading. skiprows is more general and can skip any number of initial rows based on index (e.g., skiprows=3 skips the first three rows, regardless of their content). skip_header, specifically skips rows identified as header lines. By default, skip_header=0, meaning no header is assumed. If your file has a single header row with column names, you would typically use skip_header=1 in conjunction with names=True (or provide a list of names to names) to properly load and label your data.