Web Scraping, Dealing with FTP Servers and other things — All in One
Introduction
In this article, I will walk through the Python code for Web Scraping, how to deal with tables in webpage and downloading files from FTP server.
Web Scraping is technique to extract data from website and store it in a logical format in either local file or cloud. Web Scraping works by traversing though the HTML code of website and extracting data from it based on various tags of the website.
We will be scraping the Rail Road Corporation of Texas website to download some files from there FTP server. This is a fairly complex website with different datasets present in multiple tables. Each table has 5 columns. We are interested in following columns –
- Data Set Name & Description — Name of the dataset
- Download — Link to FTP server from where the files can be downloaded
- Manual — The description of the file present in Download column
Assumptions
The article assumed that –
- You know about the need and use case of Web Scraping — why is it needed etc.
- You have a basic knowledge of HTML and its various tags
Step 1 — Inspect webpage
The first step of web scraping is to inspect the web page to find the tag from where to extract the data. The data is usually buried deep in nested tags. So, we inspect the web page to see under which tag is the data we want to scrape is nested. To know the tag, right click on the element and select Inspect Element (Q) from the menu.
In the Inspector Box we find that the data we are looking for is in a body <tbody>
of table with id = "production-data-table"
. So, to extract the data first we have to locate the table and then its body tag.
Step 2 — Get the webpage
The second step is to get the web page. The request
package in Python is used to get the details of URL.
Step 3 — Extract the data from webpage
The third step is to extract the data from the relevant section we have determined from step 1.
After getting the page in memory, we always check for status code to ensure that page download was successfully (page.status_code == 200
). To traverse through the webpage, we will use BeautifulSoup
Python package. The soup object enables us to easy navigate through various tags of web page.
This step to extract the information from the website is always custom built for the website in hand as the structure of webpage, its tags and ids will differ from one website to another. So the below code is customized for the Rail Road Corporation of Texas website.
As already found in step 1, the data we are going to extract is in Production Data table. The code for this is as follows.
Till this point we have extracted all the desired data into a dataframe. Rename the columns for easier reference.
The dataframe looks like below
Step 4 — Clean the dataframe
Clean the dataframe to easily extract meaningful information. 4 new columns are created to store the url and formats.
The updated dataframe looks like below
The information in the data_url
and data_desc_url
columns is repeating implying that the data for first 3 rows is same i.e. data for all Gas Ledger Dist and all Oil Ledger Dist etc. is same. All the data related to Gas Ledger, Oil Ledger etc. should be downloaded in different folders. Cleaning the dataframe further to retain only the unique rows.
The above code results in
Step 5 — Extract file list from FTP server
The URL in data_url
(ftp://ftpe.rrc.texas.gov/shgled) is the link of directory where all the files are kept. To extract the list of files we have to separate the FTP base URL and the folder name.
Step 5A — Create columns with FTP base URL and folder
Using regex create new columns for FTP base URL and folder name
Step 5B — Extract list of files
Python package ftplib
is used to extract the information from FTP server. First we have to login to FTP server, changing the working directory and then extract the list of file.
The dataframe after execution of above is
Step 6 — Download the files from FTP Server
First create the folder based on dataset_name
column. After creating the folder download the files based on the list of files in file_url_list
column. Downloading files from FTP and HTTP server is different. The steps to download the files from FTP server is different. The code to download the files is as below
After the execution of above code new folders and files will be created.
Conclusion
In the article I have described, in detail, the steps required for web scraping, reading the data from a webpage table and download files from FTP server.