Data Extraction: AI End-to-End Series (Part — 1)

INSAID

9 min readNov 23, 2021

By Hiren Rupchandani, Abhinav Jangir, and Ashish Lepcha

Data Extraction

Data extraction is the process of collecting and retrieving data from a variety of sources for further data processing, analysis, and storing elsewhere.
It helps recognize which information is most valuable for accomplishing your business objectives, driving the overall ETL process.
Data is typically analyzed and then crawled through in order to get any relevant information from the sources.

Various Data Extraction Techniques

1. Web Data Extraction:

Web data extraction also known as web scraping is a technique for extracting vast amounts of data from websites on the internet.
The data available on websites is not available to download easily and but can be accessed by using a web browser.
Web data is of great use for e-commerce companies, the entertainment industry, research firms, data scientists, government, social media companies and can even help the healthcare industry with ongoing research and making predictions on the spread of diseases.

2. API Based Extraction:

An API (Application Programming Interface) is a standardized and secure interface that allows applications to communicate and work with each other.
It provides a consistent and standard platform for communication between different systems, so you do not have to create an integration layer yourself.
It allows you to automate the retrieval process without needing to fetch the data each time.

3. Data Retrieval from Database:

Database extraction is a process of retrieving data from disparate databases.
In order to retrieve the desired data, the user presents a set of criteria by a query.
Then the DBMS provides the demanded data from the database.
The retrieved data may be stored in a file, printed, or viewed on the screen.

In this article, we will look into the ways to push and pull data into MongoDB and GCP using python.

MongoDB

MongoDB is a document-oriented NoSQL database used for high-volume data storage and stores data in JSON-like documents.
Instead of using tables and rows as in the traditional relational databases, MongoDB makes use of collections and documents.
Documents consist of key-value pairs which are the basic unit of data in MongoDB.
Documents have a dynamic schema. Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection’s documents may hold different types of data.
Support for modern use-cases like:
- geo-based search
- graph search
- Text search
Queries are themselves JSON, and thus easily composable.

We will be pulling the image data from MongoDB into our notebook and will display the Mask and Non-Mask labeled images.

→ Installing PyMongo

!pip install pymongo[srv]

→ Importing Libraries

from pymongo import MongoClient
import urllib
import pandas as pd
import json
import numpy as np
from bson import ObjectId
import datetime
import gridfs
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import PIL
import zipfile
import cv2
import PIL
import gridfs
import glob

→ Connecting to the database:

> client = pymongo.MongoClient(“mongodb+srv://sqweeks:sqweeks@insaid1.lhzwa.mongodb.net/myFirstDatabase?retryWrites=true&w=majority”)> db = client.test

→ Show details about the client:

> client.statsDatabase(MongoClient(host=[‘insaid1-shard-00 00.lhzwa.mongodb.net:27017’, ‘insaid1-shard-00–01.lhzwa.mongodb.net:27017’, ‘insaid1-shard-00–02.lhzwa.mongodb.net:27017’], document_class=dict, tz_aware=False, connect=True, retrywrites=True, w=’majority’, authsource=’admin’, replicaset=’atlas-mz35m9-shard-0', ssl=True), ‘stats’)

→ Check the available databases

> client.list_database_names()[‘sqweeks’, ‘test’, ‘admin’, ‘local’]

→ Create a new collection (Optional):

We can create the collection or leave it to MongoDB to create it as soon as a document is generated.

> db.create_collection(‘sqweeks0p’)

To view the collections, we write db.list_collections() .
The query returns a Cursor [ ]; an empty list ‘[]’ means that there are no collections in the database;
We need to convert it to a list to see the contents:

> list(db.list_collections())[{‘idIndex’: {‘key’: {‘_id’: 1}, ‘name’: ‘_id_’, ‘v’: 2},‘info’: {‘readOnly’: False,‘uuid’: UUID(‘2df4d7af-521e-4f86-ae92–54fa243a24bd’)},‘name’: ‘fs.chunks’,‘options’: {},‘type’: ‘collection’},{‘idIndex’: {‘key’: {‘_id’: 1}, ‘name’: ‘_id_’, ‘v’: 2},‘info’: {‘readOnly’: False,‘uuid’: UUID(‘46a46d8b-5b3b-43b7-a2c7-a731c9340616’)},‘name’: ‘fs.files’,‘options’: {},‘type’: ‘collection’},{‘idIndex’: {‘key’: {‘_id’: 1}, ‘name’: ‘_id_’, ‘v’: 2},‘info’: {‘readOnly’: False,‘uuid’: UUID(‘7b86c1af-10b7–4909–891e-599670d18aa8’)},‘name’: ‘sqweeks0p’,‘options’: {},‘type’: ‘collection’},{‘idIndex’: {‘key’: {‘_id’: 1}, ‘name’: ‘_id_’, ‘v’: 2},‘info’: {‘readOnly’: False,‘uuid’: UUID(‘8d1498c5-f351–4021-b5f7-ffef1ba11d48’)},‘name’: ‘sqweeks’,‘options’: {},‘type’: ‘collection’}]

→ Connecting to a Database:

> testCollection = db[‘sqweeks0p’]

→ Count the number of documents present in the database:

> testCollection.count_documents({})
59

→ Find the correct document:

> image=testCollection.find_one({‘name’: ‘Train Set’})[‘images’][0]
> image{‘dtype’: ‘uint8’,
‘imageID’: ObjectId(‘619148f6dff0e52ef7b01d59’),
‘imgName’: ‘0066.jpg’,
‘label’: ‘Mask’,
‘shape’: [409, 615, 3]}

→ Fetching and calling the data in our notebook

> fs = gridfs.GridFS(db)
> imOut = fs.get(image[‘imageID’])
> img = np.frombuffer(imOut.read(), dtype=np.uint8)
> img = np.reshape(img,image[‘shape’])
> plt.imshow(img)

**Displaying an image retrieved from MongoDB**

Google Cloud Platform

Google Cloud Platform is a medium with the help of which people can easily access the cloud systems and other computing services which are developed by Google.
The platform includes a wide range of services that can be used in different sectors of cloud computing, such as storage and application development.

GCP is a set of Computing, Networking, Storage, Big Data, Machine Learning, and Management services provided by Google.
It runs on the same Cloud infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, Google Photos, and YouTube.

Google Storage Services

Google Cloud Storage

Google Cloud Storage is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure.
The service combines the performance and scalability of Google’s cloud with advanced security and sharing capabilities.
Cloud Storage is a flexible, scalable, and durable storage option for your virtual machine instances.

You can read and write files to Cloud Storage buckets from almost anywhere, so you can use buckets as common storage between your instances, app engines, your on-premises systems, and other cloud services.
GCS buckets are designed to store our data regardless of format or size.
Generally, we can upload and download files using the following methods:
- Using Cloud Storage GUI
- Command Line Interface(CLI) using the gsutil tool
- Google Cloud Storage’s REST API

How to store data in Google Cloud Storage Bucket?

Create a bucket

Go to the Google Cloud Platform dashboard and create a new project.

This is how the dashboard will look like after creating the project My First Project.

Go to the Storage option under the Resources column. This will lead you to the next step which is creating a bucket for Cloud Storage.
Choose a globally unique name for your bucket.
The name of your bucket can not be changed. Buckets can’t be renamed after they are created.

The next step is choosing Storage Location. GCP provides multiple options for Geographic Location placement of our data. Here we’ll choose regional with location as Delhi(asia-south2(Delhi)).

As such, GCS defines 4 bucket types that we can choose from — Standard, Archive, Nearline, and Coldline. Here we’ll go with the default standard storage class.

The next step is to choose Access of objects, it means who and what has the access to the stored objects.
Fine-Grained control uses GCP IAM policies to determine who has access to a specific resource. In uniform access control, we won’t be able to set access per object later on.

Now choose the security option and create a bucket.

This is how the storage bucket dashboard will look like.
Now we can also upload files and folders from the dashboard.
We will be billed for your new file storage bucket based on usage. The more we use it, the more we will pay each month.

How stored data will look like?

How to download Data from GCS?

1. Using Console

In the Google Cloud Console, go to the Cloud Storage Browser page.
Go to Browser. Select the bucket from the bucket list that contains the object file we want to download.
Navigate to the object and download the object using the Download icon directly to your local system.

2. Using the gsutil command

Use the following command to download the object using gsutil tool:
> gsutil cp gs://BUCKET_NAME/OBJECT_NAME SAVE_TO_LOCATION
Here,
- BUCKET_NAME is the name of the bucket containing the object.
- OBJECT_NAME is the name of the object we are downloading.
- SAVE_TO_LOCATION is the local path where we are saving your object.

3. Using APIs

Get an authorization access token from the OAuth 2.0 Playground.
Configure the playground to use your own OAuth credentials. Here we’ll generate a token for Cloud Storage AP1 v1 to access GCS.

Use cURL to call the JSON API with a GET Object request:

> curl -X GET \
    -H "Authorization: Bearer OAUTH2_TOKEN" \
    -o "SAVE_TO_LOCATION" \
"https://storage.googleapis.com/storage/v1/b/BUCKET_NAME/o/OBJECT_NAME?alt=media"

Here,
- OAUTH2_TOKEN is the access token we generated in Step 1.
- SAVE_TO_LOCATION is the path to the location where we want to save
our object.
- BUCKET_NAME is the name of the bucket containing the object we are downloading.
- OBJECT_NAME is the name of the object we are downloading.

Importing Images Dataset from GCS

1. Using the gsutil command

The following command uses gsutil to retrieve data from GCS:

> !gsutil cp -r gs://insaid_e2e/Face_Mask_Data.zip  /content/mask_data'''
Output:
Copying gs://insaid_e2e/Face_Mask_Data.zip... \ [1 files][  7.9 MiB/  7.9 MiB]                                                 Operation completed over 1 objects/7.9 MiB.
'''

2. Using Get API Request

Get the data from the GCS using wget:

> !wget https://storage.googleapis.com/insaid_e2e/Face_Mask_Data.zip'''
Output:
https://storage.googleapis.com/insaid_e2e/Face_Mask_Data.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.196.128, 173.194.197.128, 64.233.191.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.196.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8300590 (7.9M) [application/x-zip-compressed]
Saving to: ‘Face_Mask_Data.zip’Face_Mask_Data.zip  100%[===================>]   7.92M  3.64MB/s    in 2.2s2021-11-21 14:39:06 (3.64 MB/s) - ‘Face_Mask_Data.zip’ saved [8300590/8300590]
'''

Unzip the downloaded dataset:

> !unzip /content/Face_Mask_Data.zip
'''
Output:Archive:  /content/Face_Mask_Data.zip
inflating: Data/Non Mask/112.jpg
inflating: Data/Non Mask/123.jpg
inflating: Data/Mask/0121.png
...
...
...
inflating: Data/Mask/0097.png
inflating: Data/Mask/0116.png
'''

Display any image for reference:

> image_mp= mpimg.imread(r'/content/Data/Mask/0022.jpg')
> imgplot=plt.imshow(image_mp)
> plt.show()

**The output of Sample Image (Mask/0022.jpg)**

And voila! We have successfully extracted the images displayed them in our notebook.

What’s Next?

In the next article of this series, we will see how to preprocess these images.

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.

Visit us on https://www.insaid.co/