Just wanted to drop a quick note if any one lands here and is struggling with the problem of using Scrapy’s build-in S3 feed storage and wants to upload the feed/file created by scrapy to S3 as gzipped file.

There are some methods floating around on the internet on how to gzip the feed storage (json or jsonl) and then upload them to S3 for further processing, thing is… none of them worked for me so decided to take a look if I could fix it myself.

I’m using Scrapy 2.0.1 as of the time of writing and tested this on my local Mac and on AWS Lambda where I run scrapy.

We’re going to add a new FEED STORAGE that will replace the default S3 storage.

Step 1: create (or update) the extensions.py add the below.

The extensions.py should be created at the same level as pipelines.py/middleware.py.

from tempfile import NamedTemporaryFile
from scrapy.extensions.feedexport import S3FeedStorage
import gzip
import shutil
import os
import boto3

class CustomS3FeedStorage(S3FeedStorage):
    def _store_in_thread(self, file):
        if file.tell() != 0:
            file.seek(0)
            # create temporary file for the gzip
            tmpf = NamedTemporaryFile(prefix='gzip-')

            with open(file.name, 'rb') as f_in:
                with gzip.open(tmpf.name, 'w+b') as f_out:
                    shutil.copyfileobj(f_in, f_out)

            s3 = boto3.client('s3')
            s3.upload_file(tmpf.name, self.bucketname, self.keyname)
        else:
            logger.warning('File is empty, not uploaded to S3')

Step 2: open your settings.py and add the below. Make sure you replace your “<project>” with your project name!

FEED_STORAGES = {
's3': '<project>.extensions.CustomS3FeedStorage'
}

And you’re done.. If you now run your scrapy it should upload the file as a gzip file. Note that you have to add .gz at the end of your FEED_URI otherwise it will use whatever you have now.