Working with Text Files in Python for NLP

Published by georgiannacambel on

Working with the text files

  • Working with f-strings for formated print
  • Working with .CSV, .TSV files to read and write
  • Working with %%writefile to create simple .txt files [works in jupyter notebook only]
  • Working with Python's inbuilt file read and write

Watch full video here:

String Formatter

String formatting enables us to display the strings in a specified format. This helps us to improve the visual effect and also to process the strings later.

name = 'KGP Talkie'

The format() method formats the specified value(s) and insert them inside the string's placeholder. The placeholder is defined using curly brackets: {}.

print('The YouTube channel is {}'.format(name))
The YouTube channel is KGP Talkie

To create an f-string, prefix the string with the letter “ f ”. The string itself can be formatted in much the same way that you would with str.format(). F-strings provide a concise and convenient way to embed python expressions inside string literals for formatting.

print(f'The YouTube channel is {name}')
The YouTube channel is KGP Talkie

Now we are going to see how to work with minimum width and alignment between the columns. Here we have created a list of tuples.

data_science_tuts = [('Python for Beginners', 19),
                    ('Feature Selectiong for Machine Learning', 11),
                    ('Machine Learning Tutorials', 11),
                    ('Deep Learning Tutorials', 19)]
data_science_tuts
[('Python for Beginners', 19),
 ('Feature Selectiong for Machine Learning', 11),
 ('Machine Learning Tutorials', 11),
 ('Deep Learning Tutorials', 19)]

First we will print the contents of the list without any formating or alignment.

for info in data_science_tuts:
    print(info)
('Python for Beginners', 19)
('Feature Selectiong for Machine Learning', 11)
('Machine Learning Tutorials', 11)
('Deep Learning Tutorials', 19)

Now we will print the same thing using proper alignment. Here info[0] represents the first value of the tuple and info[1] represents the second value. {50} and {20} indicate the space between the columns.

for info in data_science_tuts:
    print(f'{info[0]:{50}} {info[1]:{10}}')
Python for Beginners                                       19
Feature Selectiong for Machine Learning                    11
Machine Learning Tutorials                                 11
Deep Learning Tutorials                                    19
  • :< Forces the field to be left-aligned within the available space (this is the default for most objects).
  • :> Forces the field to be right-aligned within the available space (this is the default for numbers).
  • :^ Forces the field to be centered within the available space.

. adds the dots which you can see below.

for info in data_science_tuts:
    print(f'{info[0]:<{50}} {info[1]:.>{10}}')
Python for Beginners                               ........19
Feature Selectiong for Machine Learning            ........11
Machine Learning Tutorials                         ........11
Deep Learning Tutorials                            ........19

Working with .CSV or .TSV Files

Now we will see how to work with CSV(Comma Separated Values) and TSV(Tab Separated Values) files.

The first step is to read such files. We will use pandas to read the files.

import pandas as pd

read_csv() is an important pandas function to read CSV files. We can use it to read TSV files as well by setting sep = '\t' which means the separator is a tabhead() returns the first 5 rows of the dataframe.

data = pd.read_csv('moviereviews.tsv', sep = '\t')
data.head()
labelreview
0neghow do films like mouse hunt get into theatres...
1negsome talented actresses are blessed with a dem...
2posthis has been an extraordinary year for austra...
3posaccording to hollywood movies made in last few...
4negmy first press screening of 1998 and already i...

The shape attribute of pandas dataFrame stores the number of rows and columns as a tuple (number of rows, number of columns). In the data which was read using read_csv() there are 2000 rows and 2 columns.

data.shape
(2000, 2)

value_counts() function return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. We have called value_counts() on data['label']which is the column named label. It has 1000 occurences of neg and 1000 occurences of pos.

data['label'].value_counts()
neg    1000
pos    1000
Name: label, dtype: int64

Now we are specifying a condition data['label']=='pos'. That means we will only get those rows which have pos in their label column.

pos = data[data['label']=='pos']
pos.head()
labelreview
2posthis has been an extraordinary year for austra...
3posaccording to hollywood movies made in last few...
11poswith stars like sigourney weaver ( " alien " t...
16posi remember hearing about this film when it fir...
18posgarry shandling makes his long overdue starrin...

to_csv() method is used to save a Pandas DataFrame as a CSV file. We have stored the dataframe pos as a TSV file because we have set sep = '\t'. We have set index = False because we do not want the index to be stored in csv file.

pos.to_csv('pos.tsv', sep = '\t', index = False)
pd.read_csv('pos.tsv', sep = '\t').head()
labelreview
0posthis has been an extraordinary year for austra...
1posaccording to hollywood movies made in last few...
2poswith stars like sigourney weaver ( " alien " t...
3posi remember hearing about this film when it fir...
4posgarry shandling makes his long overdue starrin...

Built in magic command in jupyter %%writefile

%%writefile writes the contents of the cell to a file. Here the content will be written into text1.txt.

%%writefile text1.txt
Hello, this is the NLP lesson.
Please Like and Subscribe to show your support
Writing text1.txt
text1.png

-a flag is used to append contents of the cell to an existing file. The file will be created if it does not exist.

%%writefile -a text1.txt
Thanks for watching
Appending to text1.txt
text1%20a.png

Use python's inbuilt command to read and write text file

The open() function opens a file, and returns it as a file object. There are various modes in which you can open the file. Some of the basic modes are:-

  • "r" - Read - Default value. Opens a file for reading, error if the file does not exist
  • "a" - Append - Opens a file for appending, creates the file if it does not exist
  • "w" - Write - Opens a file for writing, creates the file if it does not exist
  • "x" - Create - Creates the specified file, returns an error if the file exist

We have opened the file in the read mode.

file = open('text1.txt', 'r')
file
<_io.TextIOWrapper name='text1.txt' mode='r' encoding='cp1252'>

The read() method returns the specified number of bytes from the file. Default is -1 which means the whole file.

file.read()
'Hello, this is the NLP lesson.\nPlease Like and Subscribe to show your support\nThanks for watching\n'

If we read the same file again we will get an empty string. This is because the file pointer has reached the end of the file.

file.read()
''

seek() sets the file's current position at the offset specified. We have specified the offset as 0. Hence the file pointer will be set at the start of the file.

file.seek(0)
0

Now if we read the file we will not get an empty string.

file.read()
'Hello, this is the NLP lesson.\nPlease Like and Subscribe to show your support\nThanks for watching\n'
file.seek(0)
0

readline() reads one entire line from the file. If we call it the second time it will read the second line.

file.readline()
'Hello, this is the NLP lesson.\n'
file.seek(0)
0

readlines() reads until EOF(End Of File) using readline() and returns a list containing the lines.

file.readlines()
['Hello, this is the NLP lesson.\n',
 'Please Like and Subscribe to show your support\n',
 'Thanks for watching\n']

It is a good practice to use the close() method to close a file after performing all the operations. After you close a file you cannot perform any operations on it but the file object is still available.

file.close()
file
<_io.TextIOWrapper name='text1.txt' mode='r' encoding='cp1252'>

If we do not want to explicitly close the file we can read the file in the following way.

with open('text1.txt') as file:
    text_data = file.readlines()
    print(text_data)
['Hello, this is the NLP lesson.\n', 'Please Like and Subscribe to show your support\n', 'Thanks for watching\n']

strip() returns a copy of the string with both leading and trailing characters removed.

for temp in text_data:
    print(temp.strip())
Hello, this is the NLP lesson.
Please Like and Subscribe to show your support
Thanks for watching

enumerate() method adds a counter to an iterable and returns it in a form of enumerate object. This enumerate object can then be used directly in for loops or be converted into a list of tuples using list() method.

for i, temp in enumerate(text_data):
    print(str(i) + "  --->  " + temp.strip())
0  --->  Hello, this is the NLP lesson.
1  --->  Please Like and Subscribe to show your support
2  --->  Thanks for watching

Now we will see how to write a file. For that we will open a file in the write(w) mode.

file = open('text2.txt', 'w')
file
<_io.TextIOWrapper name='text2.txt' mode='w' encoding='cp1252'>

The write() method writes a specified text to the file. It returns the number of characters written.

file.write('This is just another lesson')
27

If you see text2.txt right now it will be an empty file. This is because we need to close the file to complete the write operation.

file.close()
text2.png

An alternative to write a file is given below. In this case closing of the file is not required.

with open('text3.txt', 'w') as file:
    file.write('This is third file \n')
text3.png
text_data
['Hello, this is the NLP lesson.\n',
 'Please Like and Subscribe to show your support\n',
 'Thanks for watching\n']

Now we will open text3.txt in append mode and append the content of text_data to it.

with open('text3.txt', 'a') as file:
    for temp in text_data:
        file.write(temp)
text%203%20a.png