{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":88059195,"defaultBranch":"main","name":"cc-pyspark","ownerLogin":"commoncrawl","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2017-04-12T14:09:44.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/1194841?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1722441324.0","currentOid":""},"activityList":{"items":[{"before":"987709ce64266fc233c567f093d8f93342390f51","after":"cd64b143f363878119d9ec26e7a46b9e13097f25","ref":"refs/heads/sparkccfile","pushedAt":"2024-09-11T02:33:46.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"jt55401","name":"Jason Grey","path":"/jt55401","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1494409?s=80&v=4"},"commit":{"message":"fix bug when the file was not downloaded from s3","shortMessageHtmlLink":"fix bug when the file was not downloaded from s3"}},{"before":"ee644a58880ed7c3ae557d110f7865029d485227","after":"987709ce64266fc233c567f093d8f93342390f51","ref":"refs/heads/sparkccfile","pushedAt":"2024-09-10T00:46:12.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"jt55401","name":"Jason Grey","path":"/jt55401","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1494409?s=80&v=4"},"commit":{"message":"fix s3 functions so they work in spark environment","shortMessageHtmlLink":"fix s3 functions so they work in spark environment"}},{"before":"d384ecdc0d5513667dbb86267368839b6974ded8","after":"ee644a58880ed7c3ae557d110f7865029d485227","ref":"refs/heads/sparkccfile","pushedAt":"2024-08-03T02:35:26.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"jt55401","name":"Jason Grey","path":"/jt55401","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1494409?s=80&v=4"},"commit":{"message":"Add CCFileProcessorSparkJob example and link in readme for it.","shortMessageHtmlLink":"Add CCFileProcessorSparkJob example and link in readme for it."}},{"before":null,"after":"d384ecdc0d5513667dbb86267368839b6974ded8","ref":"refs/heads/sparkccfile","pushedAt":"2024-07-31T15:55:24.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"jt55401","name":"Jason Grey","path":"/jt55401","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1494409?s=80&v=4"},"commit":{"message":"Add sparkccfile.py to support file-wise processing in spark jobs (used in integrity job)","shortMessageHtmlLink":"Add sparkccfile.py to support file-wise processing in spark jobs (use…"}},{"before":"ed7b41f4e61741e2867c35d5a16895377875b02e","after":"1d5980ab312bdff4218475e6da40022c1f22a2fd","ref":"refs/heads/main","pushedAt":"2024-04-08T15:41:52.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"sebastian-nagel","name":"Sebastian Nagel","path":"/sebastian-nagel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1630582?s=80&v=4"},"commit":{"message":"docs: querying columnar index requires S3 access\n\nReview the sections related to data access schemes and the columnar\nindex. Emphasize that querying the columnar index requires S3 access\nand is not possible using HTTP/HTTPS access.","shortMessageHtmlLink":"docs: querying columnar index requires S3 access"}},{"before":"69ccb6149e3be52352eeb238bebe87d3a642cedd","after":"cf621165817744fdb8d07a4830e12e1ca3fbf99c","ref":"refs/heads/readme-columnar-index-no-https-access","pushedAt":"2024-04-01T14:12:46.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"sebastian-nagel","name":"Sebastian Nagel","path":"/sebastian-nagel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1630582?s=80&v=4"},"commit":{"message":"docs: querying columnar index requires S3 access\n\nReview the sections related to data access schemes and the columnar\nindex. Emphasize that querying the columnar index requires S3 access\nand is not possible using HTTP/HTTPS access.","shortMessageHtmlLink":"docs: querying columnar index requires S3 access"}},{"before":"2444b7b40cb65e534d012658c594b8216798f2a3","after":"69ccb6149e3be52352eeb238bebe87d3a642cedd","ref":"refs/heads/readme-columnar-index-no-https-access","pushedAt":"2024-04-01T14:10:17.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"sebastian-nagel","name":"Sebastian Nagel","path":"/sebastian-nagel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1630582?s=80&v=4"},"commit":{"message":"docs: querying columnar index requires S3 access\n\nReview the sections related to data access schemes and the columnar\nindex. Emphasize that querying the columnar index requires S3 access\nand is not possible using HTTP/HTTPS access.","shortMessageHtmlLink":"docs: querying columnar index requires S3 access"}},{"before":null,"after":"2444b7b40cb65e534d012658c594b8216798f2a3","ref":"refs/heads/readme-columnar-index-no-https-access","pushedAt":"2024-04-01T14:00:22.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"sebastian-nagel","name":"Sebastian Nagel","path":"/sebastian-nagel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1630582?s=80&v=4"},"commit":{"message":"fix(documentation): querying columnar index requires S3 access\n\nReview the sections related to data access schemes and the columnar\nindex. Emphasize that querying the columnar index requires S3 access\nand is not possible using HTTP/HTTPS access.","shortMessageHtmlLink":"fix(documentation): querying columnar index requires S3 access"}},{"before":"f72f9059849a5c1524e6ed3d28657e1a1d4eb64d","after":"ed7b41f4e61741e2867c35d5a16895377875b02e","ref":"refs/heads/main","pushedAt":"2023-03-16T13:33:08.327Z","pushType":"pr_merge","commitsCount":6,"pusher":{"login":"sebastian-nagel","name":"Sebastian Nagel","path":"/sebastian-nagel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1630582?s=80&v=4"},"commit":{"message":"Merge pull request #38 from commoncrawl/37-fastwarc\n\nProvide classes to use FastWARC to read WARC/WAT/WET files, resolves #37","shortMessageHtmlLink":"Merge pull request #38 from commoncrawl/37-fastwarc"}},{"before":"71d4b8c3936d88b6ed0bed2c97af66dcaf28349b","after":"27ecebad6d331460211db40aab2a832c4a503cc1","ref":"refs/heads/37-fastwarc","pushedAt":"2023-03-16T13:25:21.460Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"sebastian-nagel","name":"Sebastian Nagel","path":"/sebastian-nagel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1630582?s=80&v=4"},"commit":{"message":"Support usage of FastWARC for WARC file parsing\n- update README","shortMessageHtmlLink":"Support usage of FastWARC for WARC file parsing"}},{"before":"54918e85cf87d47e1f7278965ac04a0fc8e414a0","after":"f72f9059849a5c1524e6ed3d28657e1a1d4eb64d","ref":"refs/heads/main","pushedAt":"2023-03-16T13:20:44.712Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"sebastian-nagel","name":"Sebastian Nagel","path":"/sebastian-nagel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1630582?s=80&v=4"},"commit":{"message":"Drop support for Python 2.7, fixes #40\n- update README\n- drop support for Python 2.x module urlparse (replaced by\n urllib.parse)","shortMessageHtmlLink":"Drop support for Python 2.7, fixes #40"}},{"before":null,"after":"6f57d518941464a85a1d6f433ab5e821b0474433","ref":"refs/heads/40-drop-python-2","pushedAt":"2023-03-16T12:51:24.902Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"sebastian-nagel","name":"Sebastian Nagel","path":"/sebastian-nagel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1630582?s=80&v=4"},"commit":{"message":"Drop support for Python 2.7, fixes #40\n- update README\n- drop support for Python 2.x module urlparse (replaced by\n urllib.parse)","shortMessageHtmlLink":"Drop support for Python 2.7, fixes #40"}},{"before":"6440df268a869c16983cfa19bdf6a66c23a36262","after":"71d4b8c3936d88b6ed0bed2c97af66dcaf28349b","ref":"refs/heads/37-fastwarc","pushedAt":"2023-03-16T11:56:48.625Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"sebastian-nagel","name":"Sebastian Nagel","path":"/sebastian-nagel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1630582?s=80&v=4"},"commit":{"message":"Support usage of FastWARC for WARC file parsing\n- update README","shortMessageHtmlLink":"Support usage of FastWARC for WARC file parsing"}},{"before":null,"after":"54918e85cf87d47e1f7278965ac04a0fc8e414a0","ref":"refs/heads/python-2.7","pushedAt":"2023-03-16T11:04:06.848Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"sebastian-nagel","name":"Sebastian Nagel","path":"/sebastian-nagel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1630582?s=80&v=4"},"commit":{"message":"Webgraph construction (cc-main-2022-may-jun-aug):\n- avoid multiple extraction of host names from source and base URLs\n - implement method get_links(...) in class ExtractHostLinksJob\n - pass extracted source and base host names to method yield_links(...)\n- update IANA TLD list\n- consistent naming of source nodes (src_url, src_host instead of\n from_url, from_host)","shortMessageHtmlLink":"Webgraph construction (cc-main-2022-may-jun-aug):"}},{"before":"35dabecd784b6fbac0dbc6bedc8e81e17557a4f2","after":"6440df268a869c16983cfa19bdf6a66c23a36262","ref":"refs/heads/37-fastwarc","pushedAt":"2023-03-16T11:03:16.648Z","pushType":"push","commitsCount":2,"pusher":{"login":"sebastian-nagel","name":"Sebastian Nagel","path":"/sebastian-nagel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1630582?s=80&v=4"},"commit":{"message":"Support usage of FastWARC for WARC file parsing\n- update README","shortMessageHtmlLink":"Support usage of FastWARC for WARC file parsing"}},{"before":null,"after":"21da5843aa9b41319503a78769cc55536dabb6cc","ref":"refs/heads/simdjson","pushedAt":"2023-03-07T14:03:26.947Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"sebastian-nagel","name":"Sebastian Nagel","path":"/sebastian-nagel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1630582?s=80&v=4"},"commit":{"message":"Trial to use (py)simdjson\n- https://simdjson.org/\n- https://pysimdjson.tkte.ch/index.html\n\nTODO:\n- a first benchmark showed that the differences to ujson are marginal\n and there is no clear speed up visible at all","shortMessageHtmlLink":"Trial to use (py)simdjson"}},{"before":"d8f46986edf713ec8da2f836a7764d00135971ef","after":"35dabecd784b6fbac0dbc6bedc8e81e17557a4f2","ref":"refs/heads/37-fastwarc","pushedAt":"2023-03-07T14:03:18.035Z","pushType":"push","commitsCount":1,"pusher":{"login":"sebastian-nagel","name":"Sebastian Nagel","path":"/sebastian-nagel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1630582?s=80&v=4"},"commit":{"message":"Support usage of FastWARC for WARC file parsing\n- provide methods for encapsulation to hide differences between warcio\n and fastwarc from user methods\n- simplify fastwarc classes and avoid code duplication by using\n encapsulated methods to access WARC/HTTP headers and the payload stream","shortMessageHtmlLink":"Support usage of FastWARC for WARC file parsing"}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"Y3Vyc29yOnYyOpK7MjAyNC0wOS0xMVQwMjozMzo0Ni4wMDAwMDBazwAAAASysrk6","startCursor":"Y3Vyc29yOnYyOpK7MjAyNC0wOS0xMVQwMjozMzo0Ni4wMDAwMDBazwAAAASysrk6","endCursor":"Y3Vyc29yOnYyOpK7MjAyMy0wMy0wN1QxNDowMzoxOC4wMzU4OTdazwAAAAL-PZZF"}},"title":"Activity · commoncrawl/cc-pyspark"}