Web Scraping, Dealing with FTP Servers and other things — All in One

Divij Sharma
Analytics Vidhya
Published in
5 min readApr 8, 2020

--

Introduction

In this article, I will walk through the Python code for Web Scraping, how to deal with tables in webpage and downloading files from FTP server.

Web Scraping is technique to extract data from website and store it in a logical format in either local file or cloud. Web Scraping works by traversing though the HTML code of website and extracting data from it based on various tags of the website.

We will be scraping the Rail Road Corporation of Texas website to download some files from there FTP server. This is a fairly complex website with different datasets present in multiple tables. Each table has 5 columns. We are interested in following columns –

  1. Data Set Name & Description — Name of the dataset
  2. Download — Link to FTP server from where the files can be downloaded
  3. Manual — The description of the file present in Download column

Assumptions

The article assumed that –

  1. You know about the need and use case of Web Scraping — why is it needed etc.
  2. You have a basic knowledge of HTML and its various tags

Step 1 — Inspect webpage

The first step of web scraping is to inspect the web page to find the tag from where to extract the data. The data is usually buried deep in nested tags. So, we inspect the web page to see under which tag is the data we want to scrape is nested. To know the tag, right click on the element and select Inspect Element (Q) from the menu.

Right Click Menu

In the Inspector Box we find that the data we are looking for is in a body <tbody> of table with id = "production-data-table". So, to extract the data first we have to locate the table and then its body tag.

Inspector Box

Step 2 — Get the webpage

The second step is to get the web page. The request package in Python is used to get the details of URL.

Request the page

Step 3 — Extract the data from webpage

The third step is to extract the data from the relevant section we have determined from step 1.

After getting the page in memory, we always check for status code to ensure that page download was successfully (page.status_code == 200). To traverse through the webpage, we will use BeautifulSoup Python package. The soup object enables us to easy navigate through various tags of web page.

This step to extract the information from the website is always custom built for the website in hand as the structure of webpage, its tags and ids will differ from one website to another. So the below code is customized for the Rail Road Corporation of Texas website.

As already found in step 1, the data we are going to extract is in Production Data table. The code for this is as follows.

Extracting data from page

Till this point we have extracted all the desired data into a dataframe. Rename the columns for easier reference.

Rename the columns for easier processing

The dataframe looks like below

DataFrame with webpage table data

Step 4 — Clean the dataframe

Clean the dataframe to easily extract meaningful information. 4 new columns are created to store the url and formats.

Clean dataframe for processing

The updated dataframe looks like below

The information in the data_url and data_desc_url columns is repeating implying that the data for first 3 rows is same i.e. data for all Gas Ledger Dist and all Oil Ledger Dist etc. is same. All the data related to Gas Ledger, Oil Ledger etc. should be downloaded in different folders. Cleaning the dataframe further to retain only the unique rows.

Retain only unique rows after regex processing

The above code results in

Unique data

Step 5 — Extract file list from FTP server

The URL in data_url (ftp://ftpe.rrc.texas.gov/shgled) is the link of directory where all the files are kept. To extract the list of files we have to separate the FTP base URL and the folder name.

Step 5A — Create columns with FTP base URL and folder

Using regex create new columns for FTP base URL and folder name

Create new column with FTP base URL and folder name
DataFrame with FTP base URL and folder name

Step 5B — Extract list of files

Python package ftplib is used to extract the information from FTP server. First we have to login to FTP server, changing the working directory and then extract the list of file.

Extract list of files

The dataframe after execution of above is

DataFrame with new column having list of files in FTP Server

Step 6 — Download the files from FTP Server

First create the folder based on dataset_name column. After creating the folder download the files based on the list of files in file_url_list column. Downloading files from FTP and HTTP server is different. The steps to download the files from FTP server is different. The code to download the files is as below

Download files from FTP server

After the execution of above code new folders and files will be created.

Created Folders

Conclusion

In the article I have described, in detail, the steps required for web scraping, reading the data from a webpage table and download files from FTP server.

--

--