Working with the text files
- Working with f-strings for formated print
- Working with .CSV, .TSV files to read and write
- Working with %%writefile to create simple .txt files [works in jupyter notebook only]
- Working with Python's inbuilt file read and write
Watch full video here:
String Formatter
String formatting enables us to display the strings in a specified format. This helps us to improve the visual effect and also to process the strings later.
name = 'KGP Talkie'
The format() method formats the specified value(s) and insert them inside the string's placeholder. The placeholder is defined using curly brackets: {}.
print('The YouTube channel is {}'.format(name))
The YouTube channel is KGP Talkie
To create an f-string, prefix the string with the letter “ f ”. The string itself can be formatted in much the same way that you would with str.format(). F-strings provide a concise and convenient way to embed python expressions inside string literals for formatting.
print(f'The YouTube channel is {name}')
The YouTube channel is KGP Talkie
Now we are going to see how to work with minimum width and alignment between the columns. Here we have created a list of tuples.
data_science_tuts = [('Python for Beginners', 19),
('Feature Selectiong for Machine Learning', 11),
('Machine Learning Tutorials', 11),
('Deep Learning Tutorials', 19)]
data_science_tuts
[('Python for Beginners', 19), ('Feature Selectiong for Machine Learning', 11), ('Machine Learning Tutorials', 11), ('Deep Learning Tutorials', 19)]
First we will print the contents of the list without any formating or alignment.
for info in data_science_tuts:
print(info)
('Python for Beginners', 19)
('Feature Selectiong for Machine Learning', 11)
('Machine Learning Tutorials', 11)
('Deep Learning Tutorials', 19)
Now we will print the same thing using proper alignment. Here info[0] represents the first value of the tuple and info[1] represents the second value. {50} and {20} indicate the space between the columns.
for info in data_science_tuts:
print(f'{info[0]:{50}} {info[1]:{10}}')
Python for Beginners 19
Feature Selectiong for Machine Learning 11
Machine Learning Tutorials 11
Deep Learning Tutorials 19
:<Forces the field to be left-aligned within the available space (this is the default for most objects).:>Forces the field to be right-aligned within the available space (this is the default for numbers).:^Forces the field to be centered within the available space.
. adds the dots which you can see below.
for info in data_science_tuts:
print(f'{info[0]:{50}} {info[1]:.>{10}}')
Python for Beginners ........19
Feature Selectiong for Machine Learning ........11
Machine Learning Tutorials ........11
Deep Learning Tutorials ........19
Working with .CSV or .TSV Files
Now we will see how to work with CSV(Comma Separated Values) and TSV(Tab Separated Values) files.
The first step is to read such files. We will use pandas to read the files.
import pandas as pd
read_csv() is an important pandas function to read CSV files. We can use it to read TSV files as well by setting sep = '\t' which means the separator is a tab. head() returns the first 5 rows of the dataframe.
data = pd.read_csv('moviereviews.tsv', sep = '\t')
data.head()
| label | review | |
|---|---|---|
| 0 | neg | how do films like mouse hunt get into theatres... |
| 1 | neg | some talented actresses are blessed with a dem... |
| 2 | pos | this has been an extraordinary year for austra... |
| 3 | pos | according to hollywood movies made in last few... |
| 4 | neg | my first press screening of 1998 and already i... |
The shape attribute of pandas dataFrame stores the number of rows and columns as a tuple (number of rows, number of columns). In the data which was read using read_csv() there are 2000 rows and 2 columns.
data.shape
(2000, 2)
value_counts() function return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. We have called value_counts() on data['label']which is the column named label. It has 1000 occurences of neg and 1000 occurences of pos.
data['label'].value_counts()
neg 1000
pos 1000
Name: label, dtype: int64
Now we are specifying a condition data['label']=='pos'. That means we will only get those rows which have pos in their label column.
pos = data[data['label']=='pos']
pos.head()
| label | review | |
|---|---|---|
| 2 | pos | this has been an extraordinary year for austra... |
| 3 | pos | according to hollywood movies made in last few... |
| 11 | pos | with stars like sigourney weaver ( " alien " t... |
| 16 | pos | i remember hearing about this film when it fir... |
| 18 | pos | garry shandling makes his long overdue starrin... |
to_csv() method is used to save a Pandas DataFrame as a CSV file. We have stored the dataframe pos as a TSV file because we have set sep = '\t'. We have set index = False because we do not want the index to be stored in csv file.
pos.to_csv('pos.tsv', sep = '\t', index = False)
pd.read_csv('pos.tsv', sep = '\t').head()
| label | review | |
|---|---|---|
| 0 | pos | this has been an extraordinary year for austra... |
| 1 | pos | according to hollywood movies made in last few... |
| 2 | pos | with stars like sigourney weaver ( " alien " t... |
| 3 | pos | i remember hearing about this film when it fir... |
| 4 | pos | garry shandling makes his long overdue starrin... |
Built in magic command in jupyter %%writefile
%%writefile writes the contents of the cell to a file. Here the content will be written into text1.txt.
%%writefile text1.txt
Hello, this is the NLP lesson.
Please Like and Subscribe to show your support
Writing text1.txt

-a flag is used to append contents of the cell to an existing file. The file will be created if it does not exist.
%%writefile -a text1.txt
Thanks for watching
Appending to text1.txt

Use python's inbuilt command to read and write text file
The open() function opens a file, and returns it as a file object. There are various modes in which you can open the file. Some of the basic modes are:-
"r"- Read - Default value. Opens a file for reading, error if the file does not exist"a"- Append - Opens a file for appending, creates the file if it does not exist"w"- Write - Opens a file for writing, creates the file if it does not exist"x"- Create - Creates the specified file, returns an error if the file exist
We have opened the file in the read mode.
file = open('text1.txt', 'r')
file
The read() method returns the specified number of bytes from the file. Default is -1 which means the whole file.
file.read()
'Hello, this is the NLP lesson.\nPlease Like and Subscribe to show your support\nThanks for watching\n'
If we read the same file again we will get an empty string. This is because the file pointer has reached the end of the file.
file.read()
seek() sets the file's current position at the offset specified. We have specified the offset as 0. Hence the file pointer will be set at the start of the file.
file.seek(0)
0
Now if we read the file we will not get an empty string.
file.read()
'Hello, this is the NLP lesson.\nPlease Like and Subscribe to show your support\nThanks for watching\n'
file.seek(0)
0
readline() reads one entire line from the file. If we call it the second time it will read the second line.
file.readline()
'Hello, this is the NLP lesson.\n'
file.seek(0)
0
readlines() reads until EOF(End Of File) using readline() and returns a list containing the lines.
file.readlines()
['Hello, this is the NLP lesson.\n', 'Please Like and Subscribe to show your support\n', 'Thanks for watching\n']
It is a good practice to use the close() method to close a file after performing all the operations. After you close a file you cannot perform any operations on it but the file object is still available.
file.close()
file
If we do not want to explicitly close the file we can read the file in the following way.
with open('text1.txt') as file:
text_data = file.readlines()
print(text_data)
['Hello, this is the NLP lesson.\n', 'Please Like and Subscribe to show your support\n', 'Thanks for watching\n']
strip() returns a copy of the string with both leading and trailing characters removed.
for temp in text_data:
print(temp.strip())
Hello, this is the NLP lesson.
Please Like and Subscribe to show your support
Thanks for watching
enumerate() method adds a counter to an iterable and returns it in a form of enumerate object. This enumerate object can then be used directly in for loops or be converted into a list of tuples using list() method.
for i, temp in enumerate(text_data):
print(str(i) + " ---> " + temp.strip())
0 ---> Hello, this is the NLP lesson.
1 ---> Please Like and Subscribe to show your support
2 ---> Thanks for watching
Now we will see how to write a file. For that we will open a file in the write(w) mode.
file = open('text2.txt', 'w')
file
The write() method writes a specified text to the file. It returns the number of characters written.
file.write('This is just another lesson')
27
If you see text2.txt right now it will be an empty file. This is because we need to close the file to complete the write operation.
file.close()

An alternative to write a file is given below. In this case closing of the file is not required.
with open('text3.txt', 'w') as file:
file.write('This is third file \n')

text_data
['Hello, this is the NLP lesson.\n', 'Please Like and Subscribe to show your support\n', 'Thanks for watching\n']
Now we will open text3.txt in append mode and append the content of text_data to it.
with open('text3.txt', 'a') as file:
for temp in text_data:
file.write(temp)

Conclusion
In this tutorial you learned Python's core file I/O patterns for NLP data preparation — from f-string formatting for readable output, to reading and writing CSV/TSV datasets with pandas, to direct file operations using Python's built-in open() for plain text.
Key takeaways:
- f-strings (
f'{name}') are the modern Python standard for string interpolation, replacingstr.format()— use the:{width}and:.>{fill}format specifiers for column alignment in reports. pandas.read_csv(path, sep=' ')handles both CSV and TSV files by adjusting the separator;to_csv(..., index=False)exports without the index column.open(path, 'r')returns a file handle — always callfile.seek(0)before re-reading, or use awithblock which automatically closes the handle and avoids the stale-pointer issue.%%writefileis a Jupyter-only magic command; for production pipelines, use thewith open(path, 'w') as f: f.write(...)pattern which works everywhere.
Next steps:
- Extend these file I/O skills to PDF documents in Extract Text from PDF Files in Python for NLP.
- Apply the TSV reading workflow to load the spam dataset used in Spam Text Message Classification with NLP.
- Combine file I/O with spaCy's
nlp.pipe()for batch processing of large text corpora as shown in Processing Pipeline in spaCy.
