使用curl下载一个jpg文件碰到的问题

 
  • 最近在看python的scrapy,当然也顺便复习了下requests和selenium, 有趣的是今天忽然看到用pillow 处理图像的问题, pillow 作为PIL 在python3的继承者,拥有处理图像的功能. Document 在这里:

      https://pillow.readthedocs.io/en/5.0.0/
    
  • 当然今天并不是说pillow,而是我在使用pillow的过程中需要从网上download 一些图片文件,我从image.baidu.com 搜索kitten的来找到了一个没有版权字样的照片地址如下:

      https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1516187273001&di=878dce25579dac6338524319d784e880&imgtype=0&src=http%3A%2F%2Fi2.hdslb.com%2Fbfs%2Farchive%2Fa63066020082f2d69c45140b66718cc8cb37f8c4.jpg
    
  • 使用我常用的curl命令来下载:

      curl -o kitten.jpg -k -L https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1516187273001&di=878dce25579dac6338524319d784e880&imgtype=0&src=http%3A%2F%2Fi2.hdslb.com%2Fbfs%2Farchive%2Fa63066020082f2d69c45140b66718cc8cb37f8c4.jpg
    
  • 然后使用python代码来打开:

      def show_jpg():
          kitten=Image.open('kitten.jpg')
          kitten.show()
      show_jpg()
    
  • 在我的kali linux 下,会使用ImageMagick来打开这个图: 但是我发现这不是我想要下载的图啊,哪里出问题了: 我记得curl 有个-v (–verbose)的 选项,来看看

      * SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
      * ALPN, server accepted to use http/1.1
      * Server certificate:
      * subject: C=CN; ST=beijing; L=beijing; OU=service operation department; O=Beijing Baidu Netcom Science Technology Co., Ltd; CN=baidu.com
      * start date: Jun 29 05:32:07 2017 GMT
      * expire date: Apr 25 05:31:02 2018 GMT
      * issuer: C=BE; O=GlobalSign nv-sa; CN=GlobalSign Organization Validation CA - SHA256 - G2
      * SSL certificate verify ok.
      } [5 bytes data]
      > GET /timg?image HTTP/1.1
      > Host: timgsa.baidu.com
      > User-Agent: curl/7.55.1
      > Accept: */*
      >
    
  • 略去上面SSL的那部分,看GET /timg?image ,明白了什么, &在命令结尾会使shell把这fork出的进程放在后台运行. 仔细看看url里面有多个&,需要用引号把 url 给引住。 在来看看正确的命令和输出.

      curl -o kitten2.jpg -k -v -L "https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1516187273001&di=878dce25579dac6338524319d784e880&imgtype=0&src=http%3A%2F%2Fi2.hdslb.com%2Fbfs%2Farchive%2Fa63066020082f2d69c45140b66718cc8cb37f8c4.jpg"
    
  • 输出:

      * SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
      * ALPN, server accepted to use http/1.1
      * Server certificate:
      * subject: C=CN; ST=beijing; L=beijing; OU=service operation department; O=Beijing Baidu Netcom Science Technology Co., Ltd; CN=baidu.com
      * start date: Jun 29 05:32:07 2017 GMT
      * expire date: Apr 25 05:31:02 2018 GMT
      * issuer: C=BE; O=GlobalSign nv-sa; CN=GlobalSign Organization Validation CA - SHA256 - G2
      * SSL certificate verify ok.
      } [5 bytes data]
      > GET /timg?image&quality=80&size=b9999_10000&sec=1516187273001&di=878dce25579dac6338524319d784e880&imgtype=0&src=http%3A%2F%2Fi2.hdslb.com%2Fbfs%2Farchive%2Fa63066020082f2d69c45140b66718cc8cb37f8c4.jpg HTTP/1.1
      > Host: timgsa.baidu.com
      > User-Agent: curl/7.55.1
      > Accept: */*
      >
    
  • GET后面这么长的地址都带过来了,这就对了。 比较这两次下载的文件,第二次下载的文件明显要大的多.

      file kitten.jpg
      kitten.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 72x72, segment length 16, baseline, precision 8, 200x200, frames 3
      file kitten2.jpg
      kitten2.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 96x96, segment length 16, baseline, precision 8, 720x450, frames 3
    
  • 今天真是碰到了个教训,忘记了url里面的&在shell里面需要escape.
    不过还是重新翻了下curl的-v (–versbose)这一参数
    If you think this option still doesn’t give you enough details, consider using –trace or –trace-ascii instead