Data Extraction: AI End-to-End Series (Part — 1)

By Hiren Rupchandani, Abhinav Jangir, and Ashish Lepcha

Data Extraction

  • Data extraction is the process of collecting and retrieving data from a variety of sources for further data processing, analysis, and storing elsewhere.

Various Data Extraction Techniques

1. Web Data Extraction:

  • Web data extraction also known as web scraping is a technique for extracting vast amounts of data from websites on the internet.

2. API Based Extraction:

  • An API (Application Programming Interface) is a standardized and secure interface that allows applications to communicate and work with each other.

3. Data Retrieval from Database:

  • Database extraction is a process of retrieving data from disparate databases.

In this article, we will look into the ways to push and pull data into MongoDB and GCP using python.


  • MongoDB is a document-oriented NoSQL database used for high-volume data storage and stores data in JSON-like documents.

We will be pulling the image data from MongoDB into our notebook and will display the Mask and Non-Mask labeled images.

→ Installing PyMongo

!pip install pymongo[srv]

→ Importing Libraries

from pymongo import MongoClient
import urllib
import pandas as pd
import json
import numpy as np
from bson import ObjectId
import datetime
import gridfs
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import PIL
import zipfile
import cv2
import PIL
import gridfs
import glob

→ Connecting to the database:

> client = pymongo.MongoClient(“mongodb+srv://”)> db = client.test

→ Show details about the client:

> client.statsDatabase(MongoClient(host=[‘insaid1-shard-00’, ‘insaid1-shard-00–’, ‘insaid1-shard-00–’], document_class=dict, tz_aware=False, connect=True, retrywrites=True, w=’majority’, authsource=’admin’, replicaset=’atlas-mz35m9-shard-0', ssl=True), ‘stats’)

→ Check the available databases

> client.list_database_names()[‘sqweeks’, ‘test’, ‘admin’, ‘local’]

→ Create a new collection (Optional):

We can create the collection or leave it to MongoDB to create it as soon as a document is generated.

> db.create_collection(‘sqweeks0p’)
  • To view the collections, we write db.list_collections() .
> list(db.list_collections())[{‘idIndex’: {‘key’: {‘_id’: 1}, ‘name’: ‘_id_’, ‘v’: 2},‘info’: {‘readOnly’: False,‘uuid’: UUID(‘2df4d7af-521e-4f86-ae92–54fa243a24bd’)},‘name’: ‘fs.chunks’,‘options’: {},‘type’: ‘collection’},{‘idIndex’: {‘key’: {‘_id’: 1}, ‘name’: ‘_id_’, ‘v’: 2},‘info’: {‘readOnly’: False,‘uuid’: UUID(‘46a46d8b-5b3b-43b7-a2c7-a731c9340616’)},‘name’: ‘fs.files’,‘options’: {},‘type’: ‘collection’},{‘idIndex’: {‘key’: {‘_id’: 1}, ‘name’: ‘_id_’, ‘v’: 2},‘info’: {‘readOnly’: False,‘uuid’: UUID(‘7b86c1af-10b7–4909–891e-599670d18aa8’)},‘name’: ‘sqweeks0p’,‘options’: {},‘type’: ‘collection’},{‘idIndex’: {‘key’: {‘_id’: 1}, ‘name’: ‘_id_’, ‘v’: 2},‘info’: {‘readOnly’: False,‘uuid’: UUID(‘8d1498c5-f351–4021-b5f7-ffef1ba11d48’)},‘name’: ‘sqweeks’,‘options’: {},‘type’: ‘collection’}]

→ Connecting to a Database:

> testCollection = db[‘sqweeks0p’]

→ Count the number of documents present in the database:

> testCollection.count_documents({})

→ Find the correct document:

> image=testCollection.find_one({‘name’: ‘Train Set’})[‘images’][0]
> image
{‘dtype’: ‘uint8’,
‘imageID’: ObjectId(‘619148f6dff0e52ef7b01d59’),
‘imgName’: ‘0066.jpg’,
‘label’: ‘Mask’,
‘shape’: [409, 615, 3]}

→ Fetching and calling the data in our notebook

> fs = gridfs.GridFS(db)
> imOut = fs.get(image[‘imageID’])
> img = np.frombuffer(, dtype=np.uint8)
> img = np.reshape(img,image[‘shape’])
> plt.imshow(img)
Displaying an image retrieved from MongoDB

Google Cloud Platform

  • Google Cloud Platform is a medium with the help of which people can easily access the cloud systems and other computing services which are developed by Google.
  • GCP is a set of Computing, Networking, Storage, Big Data, Machine Learning, and Management services provided by Google.

Google Storage Services

Storage solutions by Google Storage Services

Google Cloud Storage

  • Google Cloud Storage is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure.
GCS Logo
  • You can read and write files to Cloud Storage buckets from almost anywhere, so you can use buckets as common storage between your instances, app engines, your on-premises systems, and other cloud services.

How to store data in Google Cloud Storage Bucket?

Create a bucket

  • Go to the Google Cloud Platform dashboard and create a new project.
  • This is how the dashboard will look like after creating the project My First Project.
  • Go to the Storage option under the Resources column. This will lead you to the next step which is creating a bucket for Cloud Storage.
  • The next step is choosing Storage Location. GCP provides multiple options for Geographic Location placement of our data. Here we’ll choose regional with location as Delhi(asia-south2(Delhi)).
  • As such, GCS defines 4 bucket types that we can choose from — Standard, Archive, Nearline, and Coldline. Here we’ll go with the default standard storage class.
  • The next step is to choose Access of objects, it means who and what has the access to the stored objects.
  • Now choose the security option and create a bucket.
  • This is how the storage bucket dashboard will look like.

How stored data will look like?

How to download Data from GCS?

1. Using Console

  • In the Google Cloud Console, go to the Cloud Storage Browser page.

2. Using the gsutil command

  • Use the following command to download the object using gsutil tool:

3. Using APIs

  • Get an authorization access token from the OAuth 2.0 Playground.
OAuth 2.0 Playground
  • Use cURL to call the JSON API with a GET Object request:
> curl -X GET \
-H "Authorization: Bearer OAUTH2_TOKEN" \
  • Here,
    - OAUTH2_TOKEN is the access token we generated in Step 1.
    - SAVE_TO_LOCATION is the path to the location where we want to save
    our object.
    - BUCKET_NAME is the name of the bucket containing the object we are downloading.
    - OBJECT_NAME is the name of the object we are downloading.

Importing Images Dataset from GCS

1. Using the gsutil command

The following command uses gsutil to retrieve data from GCS:

> !gsutil cp -r gs://insaid_e2e/  /content/mask_data'''
Copying gs://insaid_e2e/ \ [1 files][ 7.9 MiB/ 7.9 MiB] Operation completed over 1 objects/7.9 MiB.

2. Using Get API Request

  • Get the data from the GCS using wget:
> !wget'''
Resolving (,,, ...
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8300590 (7.9M) [application/x-zip-compressed]
Saving to: ‘’ 100%[===================>] 7.92M 3.64MB/s in 2.2s2021-11-21 14:39:06 (3.64 MB/s) - ‘’ saved [8300590/8300590]
  • Unzip the downloaded dataset:
> !unzip /content/
Archive: /content/
inflating: Data/Non Mask/112.jpg
inflating: Data/Non Mask/123.jpg
inflating: Data/Mask/0121.png
inflating: Data/Mask/0097.png
inflating: Data/Mask/0116.png
  • Display any image for reference:
> image_mp= mpimg.imread(r'/content/Data/Mask/0022.jpg')
> imgplot=plt.imshow(image_mp)
The output of Sample Image (Mask/0022.jpg)
  • And voila! We have successfully extracted the images displayed them in our notebook.

What’s Next?

In the next article of this series, we will see how to preprocess these images.

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.

Visit us on

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!