Working with Text Files in Python for NLP
Working with the text files
- Working with f-strings for formated print
- Working with .CSV, .TSV files to read and write
- Working with %%writefile to create simple .txt files [works in jupyter notebook only]
- Working with Python's inbuilt file read and write
Watch full video here:
String Formatter
String formatting enables us to display the strings in a specified format. This helps us to improve the visual effect and also to process the strings later.
name = 'KGP Talkie'
The format()
method formats the specified value(s) and insert them inside the string's placeholder. The placeholder is defined using curly brackets: {}.
print('The YouTube channel is {}'.format(name))
The YouTube channel is KGP Talkie
To create an f-string
, prefix the string with the letter “ f ”. The string itself can be formatted in much the same way that you would with str.format()
. F-strings provide a concise and convenient way to embed python expressions inside string literals for formatting.
print(f'The YouTube channel is {name}')
The YouTube channel is KGP Talkie
Now we are going to see how to work with minimum width and alignment between the columns. Here we have created a list of tuples.
data_science_tuts = [('Python for Beginners', 19), ('Feature Selectiong for Machine Learning', 11), ('Machine Learning Tutorials', 11), ('Deep Learning Tutorials', 19)] data_science_tuts
[('Python for Beginners', 19), ('Feature Selectiong for Machine Learning', 11), ('Machine Learning Tutorials', 11), ('Deep Learning Tutorials', 19)]
First we will print the contents of the list without any formating or alignment.
for info in data_science_tuts: print(info)
('Python for Beginners', 19) ('Feature Selectiong for Machine Learning', 11) ('Machine Learning Tutorials', 11) ('Deep Learning Tutorials', 19)
Now we will print the same thing using proper alignment. Here info[0]
represents the first value of the tuple and info[1]
represents the second value. {50}
and {20}
indicate the space between the columns.
for info in data_science_tuts: print(f'{info[0]:{50}} {info[1]:{10}}')
Python for Beginners 19 Feature Selectiong for Machine Learning 11 Machine Learning Tutorials 11 Deep Learning Tutorials 19
:<
Forces the field to be left-aligned within the available space (this is the default for most objects).:>
Forces the field to be right-aligned within the available space (this is the default for numbers).:^
Forces the field to be centered within the available space.
.
adds the dots which you can see below.
for info in data_science_tuts: print(f'{info[0]:<{50}} {info[1]:.>{10}}')
Python for Beginners ........19 Feature Selectiong for Machine Learning ........11 Machine Learning Tutorials ........11 Deep Learning Tutorials ........19
Working with .CSV or .TSV Files
Now we will see how to work with CSV
(Comma Separated Values) and TSV
(Tab Separated Values) files.
The first step is to read such files. We will use pandas
to read the files.
import pandas as pd
read_csv()
is an important pandas function to read CSV
files. We can use it to read TSV
files as well by setting sep = '\t'
which means the separator is a tab
. head()
returns the first 5 rows of the dataframe.
data = pd.read_csv('moviereviews.tsv', sep = '\t') data.head()
label | review | |
---|---|---|
0 | neg | how do films like mouse hunt get into theatres... |
1 | neg | some talented actresses are blessed with a dem... |
2 | pos | this has been an extraordinary year for austra... |
3 | pos | according to hollywood movies made in last few... |
4 | neg | my first press screening of 1998 and already i... |
The shape
attribute of pandas
dataFrame stores the number of rows and columns as a tuple (number of rows, number of columns)
. In the data which was read using read_csv()
there are 2000 rows and 2 columns.
data.shape
(2000, 2)
value_counts()
function return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. We have called value_counts()
on data['label']
which is the column named label
. It has 1000 occurences of neg
and 1000 occurences of pos
.
data['label'].value_counts()
neg 1000 pos 1000 Name: label, dtype: int64
Now we are specifying a condition data['label']=='pos'
. That means we will only get those rows which have pos
in their label
column.
pos = data[data['label']=='pos'] pos.head()
label | review | |
---|---|---|
2 | pos | this has been an extraordinary year for austra... |
3 | pos | according to hollywood movies made in last few... |
11 | pos | with stars like sigourney weaver ( " alien " t... |
16 | pos | i remember hearing about this film when it fir... |
18 | pos | garry shandling makes his long overdue starrin... |
to_csv()
method is used to save a Pandas
DataFrame as a CSV
file. We have stored the dataframe pos
as a TSV file because we have set sep = '\t'
. We have set index = False
because we do not want the index to be stored in csv file.
pos.to_csv('pos.tsv', sep = '\t', index = False) pd.read_csv('pos.tsv', sep = '\t').head()
label | review | |
---|---|---|
0 | pos | this has been an extraordinary year for austra... |
1 | pos | according to hollywood movies made in last few... |
2 | pos | with stars like sigourney weaver ( " alien " t... |
3 | pos | i remember hearing about this film when it fir... |
4 | pos | garry shandling makes his long overdue starrin... |
Built in magic command in jupyter %%writefile
%%writefile
writes the contents of the cell to a file. Here the content will be written into text1.txt
.
%%writefile text1.txt Hello, this is the NLP lesson. Please Like and Subscribe to show your support
Writing text1.txt
-a
flag is used to append contents of the cell to an existing file. The file will be created if it does not exist.
%%writefile -a text1.txt Thanks for watching
Appending to text1.txt
Use python's inbuilt command to read and write text file
The open()
function opens a file, and returns it as a file object. There are various modes in which you can open the file. Some of the basic modes are:-
"r"
- Read - Default value. Opens a file for reading, error if the file does not exist"a"
- Append - Opens a file for appending, creates the file if it does not exist"w"
- Write - Opens a file for writing, creates the file if it does not exist"x"
- Create - Creates the specified file, returns an error if the file exist
We have opened the file in the read mode.
file = open('text1.txt', 'r') file
<_io.TextIOWrapper name='text1.txt' mode='r' encoding='cp1252'>
The read()
method returns the specified number of bytes from the file. Default is -1 which means the whole file.
file.read()
'Hello, this is the NLP lesson.\nPlease Like and Subscribe to show your support\nThanks for watching\n'
If we read the same file again we will get an empty string. This is because the file pointer has reached the end of the file.
file.read()
''
seek()
sets the file's current position at the offset specified. We have specified the offset as 0. Hence the file pointer will be set at the start of the file.
file.seek(0)
0
Now if we read the file we will not get an empty string.
file.read()
'Hello, this is the NLP lesson.\nPlease Like and Subscribe to show your support\nThanks for watching\n'
file.seek(0)
0
readline()
reads one entire line from the file. If we call it the second time it will read the second line.
file.readline()
'Hello, this is the NLP lesson.\n'
file.seek(0)
0
readlines()
reads until EOF(End Of File) using readline()
and returns a list containing the lines.
file.readlines()
['Hello, this is the NLP lesson.\n', 'Please Like and Subscribe to show your support\n', 'Thanks for watching\n']
It is a good practice to use the close()
method to close a file after performing all the operations. After you close a file you cannot perform any operations on it but the file object is still available.
file.close() file
<_io.TextIOWrapper name='text1.txt' mode='r' encoding='cp1252'>
If we do not want to explicitly close the file we can read the file in the following way.
with open('text1.txt') as file: text_data = file.readlines() print(text_data)
['Hello, this is the NLP lesson.\n', 'Please Like and Subscribe to show your support\n', 'Thanks for watching\n']
strip()
returns a copy of the string with both leading and trailing characters removed.
for temp in text_data: print(temp.strip())
Hello, this is the NLP lesson. Please Like and Subscribe to show your support Thanks for watching
enumerate()
method adds a counter to an iterable and returns it in a form of enumerate object. This enumerate object can then be used directly in for loops or be converted into a list of tuples using list() method.
for i, temp in enumerate(text_data): print(str(i) + " ---> " + temp.strip())
0 ---> Hello, this is the NLP lesson. 1 ---> Please Like and Subscribe to show your support 2 ---> Thanks for watching
Now we will see how to write a file. For that we will open a file in the write(w) mode.
file = open('text2.txt', 'w') file
<_io.TextIOWrapper name='text2.txt' mode='w' encoding='cp1252'>
The write()
method writes a specified text to the file. It returns the number of characters written.
file.write('This is just another lesson')
27
If you see text2.txt
right now it will be an empty file. This is because we need to close the file to complete the write operation.
file.close()
An alternative to write a file is given below. In this case closing of the file is not required.
with open('text3.txt', 'w') as file: file.write('This is third file \n')
text_data
['Hello, this is the NLP lesson.\n', 'Please Like and Subscribe to show your support\n', 'Thanks for watching\n']
Now we will open text3.txt
in append mode and append the content of text_data
to it.
with open('text3.txt', 'a') as file: for temp in text_data: file.write(temp)
1 Comment