Working with Text Files in Python for NLP

Working with the text files

Working with f-strings for formated print
Working with .CSV, .TSV files to read and write
Working with %%writefile to create simple .txt files [works in jupyter notebook only]
Working with Python's inbuilt file read and write

Watch full video here:

String Formatter

String formatting enables us to display the strings in a specified format. This helps us to improve the visual effect and also to process the strings later.

PYTHON

name = 'KGP Talkie'

The format() method formats the specified value(s) and insert them inside the string's placeholder. The placeholder is defined using curly brackets: {}.

PYTHON

print('The YouTube channel is {}'.format(name))

OUTPUT

The YouTube channel is KGP Talkie

To create an f-string, prefix the string with the letter “ f ”. The string itself can be formatted in much the same way that you would with str.format(). F-strings provide a concise and convenient way to embed python expressions inside string literals for formatting.

PYTHON

print(f'The YouTube channel is {name}')

OUTPUT

The YouTube channel is KGP Talkie

Now we are going to see how to work with minimum width and alignment between the columns. Here we have created a list of tuples.

PYTHON

data_science_tuts = [('Python for Beginners', 19),
                    ('Feature Selectiong for Machine Learning', 11),
                    ('Machine Learning Tutorials', 11),
                    ('Deep Learning Tutorials', 19)]
data_science_tuts

OUTPUT

[('Python for Beginners', 19), ('Feature Selectiong for Machine Learning', 11), ('Machine Learning Tutorials', 11), ('Deep Learning Tutorials', 19)]

First we will print the contents of the list without any formating or alignment.

PYTHON

for info in data_science_tuts:
    print(info)

OUTPUT

('Python for Beginners', 19)
('Feature Selectiong for Machine Learning', 11)
('Machine Learning Tutorials', 11)
('Deep Learning Tutorials', 19)

Now we will print the same thing using proper alignment. Here info[0] represents the first value of the tuple and info[1] represents the second value. {50} and {20} indicate the space between the columns.

PYTHON

for info in data_science_tuts:
    print(f'{info[0]:{50}} {info[1]:{10}}')

OUTPUT

Python for Beginners                                       19
Feature Selectiong for Machine Learning                    11
Machine Learning Tutorials                                 11
Deep Learning Tutorials                                    19

:< Forces the field to be left-aligned within the available space (this is the default for most objects).
:> Forces the field to be right-aligned within the available space (this is the default for numbers).
:^ Forces the field to be centered within the available space.

. adds the dots which you can see below.

PYTHON

for info in data_science_tuts:
    print(f'{info[0]:{50}} {info[1]:.>{10}}')

OUTPUT

Python for Beginners                               ........19
Feature Selectiong for Machine Learning            ........11
Machine Learning Tutorials                         ........11
Deep Learning Tutorials                            ........19

Working with .CSV or .TSV Files

Now we will see how to work with CSV(Comma Separated Values) and TSV(Tab Separated Values) files.

The first step is to read such files. We will use pandas to read the files.

PYTHON

import pandas as pd

read_csv() is an important pandas function to read CSV files. We can use it to read TSV files as well by setting sep = '\t' which means the separator is a tab. head() returns the first 5 rows of the dataframe.

PYTHON

data = pd.read_csv('moviereviews.tsv', sep = '\t')
data.head()

OUTPUT

	label	review
0	neg	how do films like mouse hunt get into theatres...
1	neg	some talented actresses are blessed with a dem...
2	pos	this has been an extraordinary year for austra...
3	pos	according to hollywood movies made in last few...
4	neg	my first press screening of 1998 and already i...

The shape attribute of pandas dataFrame stores the number of rows and columns as a tuple (number of rows, number of columns). In the data which was read using read_csv() there are 2000 rows and 2 columns.

PYTHON

data.shape

OUTPUT

(2000, 2)

value_counts() function return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. We have called value_counts() on data['label']which is the column named label. It has 1000 occurences of neg and 1000 occurences of pos.

PYTHON

data['label'].value_counts()

OUTPUT

neg    1000
pos    1000
Name: label, dtype: int64

Now we are specifying a condition data['label']=='pos'. That means we will only get those rows which have pos in their label column.

PYTHON

pos = data[data['label']=='pos']
pos.head()

OUTPUT

	label	review
2	pos	this has been an extraordinary year for austra...
3	pos	according to hollywood movies made in last few...
11	pos	with stars like sigourney weaver ( " alien " t...
16	pos	i remember hearing about this film when it fir...
18	pos	garry shandling makes his long overdue starrin...

to_csv() method is used to save a Pandas DataFrame as a CSV file. We have stored the dataframe pos as a TSV file because we have set sep = '\t'. We have set index = False because we do not want the index to be stored in csv file.

PYTHON

pos.to_csv('pos.tsv', sep = '\t', index = False)
pd.read_csv('pos.tsv', sep = '\t').head()

OUTPUT

	label	review
0	pos	this has been an extraordinary year for austra...
1	pos	according to hollywood movies made in last few...
2	pos	with stars like sigourney weaver ( " alien " t...
3	pos	i remember hearing about this film when it fir...
4	pos	garry shandling makes his long overdue starrin...

Built in magic command in jupyter `%%writefile`

%%writefile writes the contents of the cell to a file. Here the content will be written into text1.txt.

PYTHON

%%writefile text1.txt
Hello, this is the NLP lesson.
Please Like and Subscribe to show your support

OUTPUT

Writing text1.txt

Screenshot of text1.txt file contents showing the NLP lesson text written with %%writefile magic command

-a flag is used to append contents of the cell to an existing file. The file will be created if it does not exist.

PYTHON

%%writefile -a text1.txt
Thanks for watching

OUTPUT

Appending to text1.txt

Screenshot of text1.txt after appending a third line using %%writefile -a flag

Use python's inbuilt command to read and write text file

The open() function opens a file, and returns it as a file object. There are various modes in which you can open the file. Some of the basic modes are:-

"r" - Read - Default value. Opens a file for reading, error if the file does not exist
"a" - Append - Opens a file for appending, creates the file if it does not exist
"w" - Write - Opens a file for writing, creates the file if it does not exist
"x" - Create - Creates the specified file, returns an error if the file exist

We have opened the file in the read mode.

PYTHON

file = open('text1.txt', 'r')
file

The read() method returns the specified number of bytes from the file. Default is -1 which means the whole file.

PYTHON

file.read()

OUTPUT

'Hello, this is the NLP lesson.\nPlease Like and Subscribe to show your support\nThanks for watching\n'

If we read the same file again we will get an empty string. This is because the file pointer has reached the end of the file.

PYTHON

file.read()

seek() sets the file's current position at the offset specified. We have specified the offset as 0. Hence the file pointer will be set at the start of the file.

PYTHON

file.seek(0)

OUTPUT

Now if we read the file we will not get an empty string.

PYTHON

file.read()

OUTPUT

'Hello, this is the NLP lesson.\nPlease Like and Subscribe to show your support\nThanks for watching\n'

PYTHON

file.seek(0)

OUTPUT

readline() reads one entire line from the file. If we call it the second time it will read the second line.

PYTHON

file.readline()

OUTPUT

'Hello, this is the NLP lesson.\n'

PYTHON

file.seek(0)

OUTPUT

readlines() reads until EOF(End Of File) using readline() and returns a list containing the lines.

PYTHON

file.readlines()

OUTPUT

['Hello, this is the NLP lesson.\n', 'Please Like and Subscribe to show your support\n', 'Thanks for watching\n']

It is a good practice to use the close() method to close a file after performing all the operations. After you close a file you cannot perform any operations on it but the file object is still available.

PYTHON

file.close()
file

If we do not want to explicitly close the file we can read the file in the following way.

PYTHON

with open('text1.txt') as file:
    text_data = file.readlines()
    print(text_data)

OUTPUT

['Hello, this is the NLP lesson.\n', 'Please Like and Subscribe to show your support\n', 'Thanks for watching\n']

strip() returns a copy of the string with both leading and trailing characters removed.

PYTHON

for temp in text_data:
    print(temp.strip())

OUTPUT

Hello, this is the NLP lesson.
Please Like and Subscribe to show your support
Thanks for watching

enumerate() method adds a counter to an iterable and returns it in a form of enumerate object. This enumerate object can then be used directly in for loops or be converted into a list of tuples using list() method.

PYTHON

for i, temp in enumerate(text_data):
    print(str(i) + "  --->  " + temp.strip())

OUTPUT

0  --->  Hello, this is the NLP lesson.
1  --->  Please Like and Subscribe to show your support
2  --->  Thanks for watching

Now we will see how to write a file. For that we will open a file in the write(w) mode.

PYTHON

file = open('text2.txt', 'w')
file

The write() method writes a specified text to the file. It returns the number of characters written.

PYTHON

file.write('This is just another lesson')

OUTPUT

If you see text2.txt right now it will be an empty file. This is because we need to close the file to complete the write operation.

PYTHON

file.close()

Screenshot of text2.txt file created with Python open() in write mode containing 27 characters

An alternative to write a file is given below. In this case closing of the file is not required.

PYTHON

with open('text3.txt', 'w') as file:
    file.write('This is third file \n')

Screenshot of text3.txt file written using Python context manager with open() showing single line

PYTHON

text_data

OUTPUT

['Hello, this is the NLP lesson.\n', 'Please Like and Subscribe to show your support\n', 'Thanks for watching\n']

Now we will open text3.txt in append mode and append the content of text_data to it.

PYTHON

with open('text3.txt', 'a') as file:
    for temp in text_data:
        file.write(temp)

Screenshot of text3.txt after appending text_data lines using open() in append mode

Conclusion

In this tutorial you learned Python's core file I/O patterns for NLP data preparation — from f-string formatting for readable output, to reading and writing CSV/TSV datasets with pandas, to direct file operations using Python's built-in open() for plain text.

Key takeaways:

f-strings (f'{name}') are the modern Python standard for string interpolation, replacing str.format() — use the :{width} and :.>{fill} format specifiers for column alignment in reports.
pandas.read_csv(path, sep=' ') handles both CSV and TSV files by adjusting the separator; to_csv(..., index=False) exports without the index column.
open(path, 'r') returns a file handle — always call file.seek(0) before re-reading, or use a with block which automatically closes the handle and avoids the stale-pointer issue.
%%writefile is a Jupyter-only magic command; for production pipelines, use the with open(path, 'w') as f: f.write(...) pattern which works everywhere.

Next steps:

Extend these file I/O skills to PDF documents in Extract Text from PDF Files in Python for NLP.
Apply the TSV reading workflow to load the spam dataset used in Spam Text Message Classification with NLP.
Combine file I/O with spaCy's nlp.pipe() for batch processing of large text corpora as shown in Processing Pipeline in spaCy.

Working with Text Files in Python for NLP

Topics You Will Master

Working with the text files

Watch full video here:

String Formatter

Working with .CSV or .TSV Files

Built in magic command in jupyter `%%writefile`

Use python's inbuilt command to read and write text file

Conclusion

Latest recommendations you might like

NLP: End to End Text Processing for Beginners

Text Summarization using NLP

Find this tutorial useful?

Discussion & Comments

Topics You Will Master

Working with the text files

Watch full video here:

String Formatter

Working with .CSV or .TSV Files

Built in magic command in jupyter %%writefile

Use python's inbuilt command to read and write text file

Conclusion

Latest recommendations you might like

NLP: End to End Text Processing for Beginners

Text Summarization using NLP

Find this tutorial useful?

Discussion & Comments

Built in magic command in jupyter `%%writefile`