-
Notifications
You must be signed in to change notification settings - Fork 23
introduce ZyteAPITextResponse and ZyteAPIResponse to store raw Zyte Data API Response #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 21 commits
9a83471
8909473
d0dc08d
109dbf0
9695880
8812a05
ba64103
84dac7d
5b83443
fb0b412
10a4603
b7102fa
2b4a0fb
97ea1e4
5dd1bec
052d0d6
48a4766
2455bdf
910085b
e3214d8
e530053
27c7a7d
5b7cf6f
2adc8a6
32faf3d
cec0677
e0865e7
34a427f
37a4cc7
f5a9bb0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -33,8 +33,8 @@ Installation | |
|
|
||
| This package requires Python 3.7+. | ||
|
|
||
| How to configure | ||
| ---------------- | ||
| Configuration | ||
| ------------- | ||
|
|
||
| Replace the default ``http`` and ``https`` in Scrapy's | ||
| `DOWNLOAD_HANDLERS <https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-DOWNLOAD_HANDLERS>`_ | ||
|
|
@@ -46,7 +46,7 @@ Lastly, make sure to `install the asyncio-based Twisted reactor | |
| <https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor)>`_ | ||
| in the ``settings.py`` file as well: | ||
|
|
||
| Here's example of the things needed inside a Scrapy project's ``settings.py`` file: | ||
| Here's an example of the things needed inside a Scrapy project's ``settings.py`` file: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
|
|
@@ -60,37 +60,75 @@ Here's example of the things needed inside a Scrapy project's ``settings.py`` fi | |
|
|
||
| TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" | ||
|
|
||
| How to use | ||
| ---------- | ||
| Usage | ||
| ----- | ||
|
|
||
| Set the ``zyte_api`` `Request.meta | ||
| <https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_ | ||
| key to download a request using Zyte API. Full list of parameters is provided in the | ||
| `Zyte API Specification <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_. | ||
| To enable every request to be sent through Zyte API, you can set the following | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure the setting below makes every request to be sent through Zyte API. If I'm reading the code correctly, it first checks if zyte_api meta key is present and the value is true-ish, and uses ZYTE_API_DEFAULT_PARAMS only in this case. So, parameters are only applied for requests which are already marked explicitly as Zyte API requests. That's actually the behavior I'm fine with :) It looks useful e.g. to set default geolocation for Zyte API requests made from a spider. On the other hand, making every request going through Zyte API by using this option looks quite problematic; it'd require more thought. For example,
It seems some feature to "enable Zyte API transparently" makes sense after we put a bit more effort into "transparent" integration of scrapy Requests with Zyte API direct downloader, it doesn't look as straightforward as setting default parameters.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's right. #13 attempted to not change the behavior fully. It only addressed to prevent the user from repeating the same Zyte API parameters in different parts of the code. In any case, hopefully it should serve as a base point to enable all requests go through Zyte API. :)
That's a good point. This also extends to other things like downloading an image, file, etc. I think one option is to prevent This could be a bit tricky since Zyte API offers a lot of features that doesn't only cater to being a simply proxy. I think for users to have a seamless experience when using Zyte API, we should have an interface that could cover general Scrapy use. This could mean only downloading the We can open up another PR to explore the different options here.
I think this could be alleviated somehow by the proposed interface in #20.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think @kmike’s main point was that the documentation needs an update, as it is currently quite misleading. The other points are about issues if we decided to make the implementation match the current documentation, which I don’t think we want, at least not at this point.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, thanks for clarifying! What do you think about this doc update? 2adc8a6
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The documentation change looks great. I do wonder if we should cause
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm a good point. There could be a few cases for this:
We could use meta= {"zyte_api": {}} # default
if does_it_look_like_we_need_javascript():
meta["zyte_api"].update({"javascript": True})
if how_about_using_fr_region():
meta["zyte_api"].update({"geolocation": "FR"})
yield scrapy.Request(url, self.callback_func, meta=meta)To summarize, we could use Any thoughts on this?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to having
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to be the only remaining fix to make, the PR looks good to me otherwise 👍
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Refactored the code to accept |
||
| in the ``settings.py`` file or `any other settings within Scrapy | ||
| <https://docs.scrapy.org/en/latest/topics/settings.html#populating-the-settings>`_: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| import scrapy | ||
|
|
||
| ZYTE_API_DEFAULT_PARAMS = { | ||
| "browserHtml": True, | ||
| "geolocation": "US", | ||
| } | ||
|
|
||
| class TestSpider(scrapy.Spider): | ||
| name = "test" | ||
| You can see the full list of parameters in the `Zyte API Specification | ||
| <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_. | ||
|
|
||
| def start_requests(self): | ||
| On the other hand, you could also control it on a per request basis by setting the | ||
|
BurnzZ marked this conversation as resolved.
Outdated
|
||
| ``zyte_api`` key in `Request.meta <https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_. | ||
| When doing so, it will override any parameters that was set in the | ||
|
BurnzZ marked this conversation as resolved.
Outdated
|
||
| ``ZYTE_API_DEFAULT_PARAMS`` setting. | ||
|
|
||
| yield scrapy.Request( | ||
| url="http://books.toscrape.com/", | ||
| callback=self.parse, | ||
| meta={ | ||
| "zyte_api": { | ||
| "browserHtml": True, | ||
| # You can set any GEOLocation region you want. | ||
| "geolocation": "US", | ||
| "javascript": True, | ||
| "echoData": {"something": True}, | ||
| } | ||
| }, | ||
| ) | ||
| .. code-block:: python | ||
|
|
||
| def parse(self, response): | ||
| yield {"URL": response.url, "status": response.status, "HTML": response.body} | ||
| import scrapy | ||
|
|
||
|
|
||
| class SampleQuotesSpider(scrapy.Spider): | ||
| name = "sample_quotes" | ||
|
|
||
| def start_requests(self): | ||
|
|
||
| yield scrapy.Request( | ||
| url="http://books.toscrape.com/", | ||
| callback=self.parse, | ||
| meta={ | ||
| "zyte_api": { | ||
| "browserHtml": True, | ||
| "geolocation": "US", # You can set any Geolocation region you want. | ||
| "javascript": True, | ||
| "echoData": {"some_value_I_could_track": 123}, | ||
| } | ||
| }, | ||
| ) | ||
|
|
||
| def parse(self, response): | ||
| yield {"URL": response.url, "status": response.status, "HTML": response.body} | ||
|
|
||
| print(response.zyte_api) | ||
| # { | ||
| # 'url': 'https://quotes.toscrape.com/', | ||
| # 'browserHtml': '<html> ... </html>', | ||
| # 'echoData': {'some_value_I_could_track': 123}, | ||
| # } | ||
|
|
||
| print(response.request.meta) | ||
| # { | ||
| # 'zyte_api': { | ||
| # 'browserHtml': True, | ||
| # 'geolocation': 'US', | ||
| # 'javascript': True, | ||
| # 'echoData': {'some_value_I_could_track': 123} | ||
| # }, | ||
| # 'download_timeout': 180.0, | ||
| # 'download_slot': 'quotes.toscrape.com' | ||
| # } | ||
|
|
||
| The raw Zyte API Response can be accessed via the ``zyte_api`` attribute | ||
|
BurnzZ marked this conversation as resolved.
Outdated
|
||
| of the response object. Note that such responses are of ``ZyteAPIResponse`` and | ||
| ``ZyteAPITextResponse`` which are respectively subclasses of ``scrapy.http.Response`` | ||
|
BurnzZ marked this conversation as resolved.
Outdated
|
||
| and ``scrapy.http.TextResponse``. Such classes are needed to hold the raw Zyte API | ||
|
BurnzZ marked this conversation as resolved.
Outdated
|
||
| responses. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,123 @@ | ||
| from base64 import b64decode | ||
| from typing import Dict, List, Optional, Union | ||
|
|
||
| from scrapy import Request | ||
| from scrapy.http import Response, TextResponse | ||
| from scrapy.responsetypes import responsetypes | ||
|
|
||
| _DEFAULT_ENCODING = "utf-8" | ||
|
|
||
|
|
||
| class ZyteAPIMixin: | ||
|
kmike marked this conversation as resolved.
|
||
|
|
||
| REMOVE_HEADERS = { | ||
| # Zyte API already decompresses the HTTP Response Body. Scrapy's | ||
| # HttpCompressionMiddleware will error out when it attempts to | ||
| # decompress an already decompressed body based on this header. | ||
| "content-encoding" | ||
| } | ||
|
|
||
| def __init__(self, *args, zyte_api: Dict = None, **kwargs): | ||
| super().__init__(*args, **kwargs) | ||
| self._zyte_api = zyte_api | ||
|
|
||
| def replace(self, *args, **kwargs): | ||
| """Create a new response with the same attributes except for those given | ||
| new values. | ||
| """ | ||
| return super().replace(*args, **kwargs) | ||
|
|
||
| @property | ||
| def zyte_api(self) -> Optional[Dict]: | ||
|
kmike marked this conversation as resolved.
Outdated
|
||
| """Contains the raw API response from Zyte API. | ||
|
|
||
| To see the full list of parameters and their description, kindly refer to the | ||
| `Zyte API Specification <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_. | ||
| """ | ||
| return self._zyte_api | ||
|
|
||
| @classmethod | ||
| def _prepare_headers(cls, init_headers: Optional[List[Dict[str, str]]]): | ||
| if not init_headers: | ||
| return None | ||
| return { | ||
| h["name"]: h["value"] | ||
| for h in init_headers | ||
| if h["name"].lower() not in cls.REMOVE_HEADERS | ||
| } | ||
|
|
||
|
|
||
| class ZyteAPITextResponse(ZyteAPIMixin, TextResponse): | ||
| @classmethod | ||
| def from_api_response(cls, api_response: Dict, *, request: Request = None): | ||
| """Alternative constructor to instantiate the response from the raw | ||
| Zyte API response. | ||
| """ | ||
| body = None | ||
| encoding = None | ||
|
|
||
| if api_response.get("browserHtml"): | ||
| encoding = _DEFAULT_ENCODING # Zyte API has "utf-8" by default | ||
| body = api_response["browserHtml"].encode(encoding) | ||
| elif api_response.get("httpResponseBody"): | ||
| body = b64decode(api_response["httpResponseBody"]) | ||
|
|
||
| return cls( | ||
| url=api_response["url"], | ||
| status=200, | ||
|
Gallaecio marked this conversation as resolved.
|
||
| body=body, | ||
| encoding=encoding, | ||
| request=request, | ||
| flags=["zyte-api"], | ||
| headers=cls._prepare_headers(api_response.get("httpResponseHeaders")), | ||
| zyte_api=api_response, | ||
| ) | ||
|
|
||
|
|
||
| class ZyteAPIResponse(ZyteAPIMixin, Response): | ||
| @classmethod | ||
| def from_api_response(cls, api_response: Dict, *, request: Request = None): | ||
| """Alternative constructor to instantiate the response from the raw | ||
| Zyte API response. | ||
| """ | ||
| return cls( | ||
| url=api_response["url"], | ||
| status=200, | ||
| body=b64decode(api_response.get("httpResponseBody") or ""), | ||
| request=request, | ||
| flags=["zyte-api"], | ||
| headers=cls._prepare_headers(api_response.get("httpResponseHeaders")), | ||
| zyte_api=api_response, | ||
| ) | ||
|
|
||
|
|
||
| def process_response( | ||
|
BurnzZ marked this conversation as resolved.
Outdated
|
||
| api_response: Dict[str, Union[List[Dict], str]], request: Request | ||
| ) -> Optional[Union[ZyteAPITextResponse, ZyteAPIResponse]]: | ||
| """Given a Zyte API Response and the ``scrapy.Request`` that asked for it, | ||
| this returns either a ``ZyteAPITextResponse`` or ``ZyteAPIResponse`` depending | ||
| on which if it can properly decode the HTTP Body or have access to browserHtml. | ||
| """ | ||
|
|
||
| # NOTES: Currently, Zyte API does NOT only allow both 'browserHtml' and | ||
| # 'httpResponseBody' to be present at the same time. The support for both | ||
| # will be addressed in the future. Reference: | ||
| # - https://github.com/scrapy-plugins/scrapy-zyte-api/pull/10#issuecomment-1131406460 | ||
| # For now, at least one of them should be present. | ||
|
|
||
| if api_response.get("browserHtml"): | ||
| # Using TextResponse because browserHtml always returns a browser-rendered page | ||
| # even when requesting files (like images) | ||
| return ZyteAPITextResponse.from_api_response(api_response, request=request) | ||
|
|
||
| if api_response.get("httpResponseHeaders") and api_response.get("httpResponseBody"): | ||
| response_cls = responsetypes.from_args( | ||
| headers=api_response["httpResponseHeaders"], | ||
| url=api_response["url"], | ||
| # FIXME: update this when python-zyte-api supports base64 decoding | ||
| body=b64decode(api_response["httpResponseBody"]), # type: ignore | ||
| ) | ||
| if issubclass(response_cls, TextResponse): | ||
| return ZyteAPITextResponse.from_api_response(api_response, request=request) | ||
|
|
||
| return ZyteAPIResponse.from_api_response(api_response, request=request) | ||
Uh oh!
There was an error while loading. Please reload this page.