datasources.linkedin.search_linkedin
Import scraped LinkedIn data
It's prohibitively difficult to scrape data from LinkedIn within 4CAT itself due to its aggressive rate limiting and login wall. Instead, import data collected elsewhere.
1""" 2Import scraped LinkedIn data 3 4It's prohibitively difficult to scrape data from LinkedIn within 4CAT itself 5due to its aggressive rate limiting and login wall. Instead, import data 6collected elsewhere. 7""" 8import datetime 9import time 10import re 11 12from backend.lib.search import Search 13from common.lib.item_mapping import MappedItem 14 15class SearchLinkedIn(Search): 16 """ 17 Import scraped LinkedIn data 18 """ 19 type = "linkedin-search" # job ID 20 category = "Search" # category 21 title = "Import scraped LinkedIn data" # title displayed in UI 22 description = "Import LinkedIn data collected with an external tool such as Zeeschuimer." # description displayed in UI 23 extension = "ndjson" # extension of result file, used internally and in UI 24 is_from_zeeschuimer = True 25 26 # not available as a processor for existing datasets 27 accepts = [None] 28 references = [ 29 "[Zeeschuimer browser extension](https://github.com/digitalmethodsinitiative/zeeschuimer)", 30 "[Worksheet: Capturing TikTok data with Zeeschuimer and 4CAT](https://tinyurl.com/nmrw-zeeschuimer-tiktok) (also explains general usage of Zeeschuimer)" 31 ] 32 33 def get_items(self, query): 34 """ 35 Run custom search 36 37 Not available for LinkedIn 38 """ 39 raise NotImplementedError("LinkedIn datasets can only be created by importing data from elsewhere") 40 41 @staticmethod 42 def map_item(item): 43 """ 44 Parse LinkedIn post in Voyager V2 format 45 46 'Voyager V2' seems to be how the format is referred to in the data 47 itself... 48 49 :param item: Data as received from LinkedIn 50 :return dict: Mapped item 51 """ 52 53 # annoyingly, posts don't come with a timestamp 54 # approximate it by using the time of collection and the "time ago" 55 # included with the post (e.g. 'published 18h ago') 56 if not item.get("actor"): 57 return {} 58 59 if "__import_meta" in item: 60 time_collected = int(item["__import_meta"]["timestamp_collected"] / 1000) # milliseconds 61 else: 62 # best we got 63 time_collected = int(time.time()) 64 65 time_ago = item["actor"]["subDescription"]["text"] if item["actor"].get("subDescription") else "" 66 timestamp = int(time_collected - SearchLinkedIn.parse_time_ago(time_ago)) 67 68 # images are stored in some convoluted way 69 # there are multiple URLs for various thumbnails, use the one for the 70 # largest version of the image 71 images = [] 72 if item["content"] and "images" in item["content"]: 73 for image in item["content"]["images"]: 74 image_data = image["attributes"][0]["vectorImage"] 75 artifacts = sorted(image_data["artifacts"], key=lambda x: x["width"], reverse=True) 76 url = image_data["rootUrl"] + artifacts[0]["fileIdentifyingUrlPathSegment"] 77 images.append(url) 78 79 # or alternatively they are stored here: 80 if not images and item["content"] and item["content"].get("articleComponent") and item["content"]["articleComponent"].get("largeImage"): 81 image = item["content"]["articleComponent"]["largeImage"]["attributes"][0]["detailData"]["vectorImage"] 82 if not image and item["content"]["articleComponent"]["largeImage"]["attributes"][0]["imageUrl"]: 83 images.append(item["content"]["articleComponent"]["largeImage"]["attributes"][0]["imageUrl"]["url"]) 84 elif image and image.get("artifacts"): 85 images.append(image["rootUrl"] + image["artifacts"][0]["fileIdentifyingUrlPathSegment"]) 86 87 # video thumbnails are stored similarly as image data 88 video_thumb_url = "" 89 thumb_content = None 90 if item["content"] and "*videoPlayMetadata" in item["content"]: 91 thumb_content = item["content"]["*videoPlayMetadata"]["thumbnail"] 92 elif item["content"] and "linkedInVideoComponent" in item["content"] and item["content"]["linkedInVideoComponent"]: 93 thumb_content = item["content"]["linkedInVideoComponent"]["*videoPlayMetadata"]["thumbnail"] 94 elif item["content"] and "externalVideoComponent" in item["content"] and item["content"]["externalVideoComponent"]: 95 thumb_content = item["content"]["externalVideoComponent"]["*videoPlayMetadata"]["thumbnail"] 96 if thumb_content: 97 video_thumb_url = thumb_content["rootUrl"] + thumb_content["artifacts"][0]["fileIdentifyingUrlPathSegment"] 98 99 author = SearchLinkedIn.get_author(item) 100 101 # the ID is in the format 'urn:li:activity:6960882777168695296' 102 # retain the numerical part as the item ID for 4CAT 103 # sometimes posts seem to be combined, e.g.: 104 # urn:li:aggregate:(urn:li:activity:3966023054712791616,urn:li:activity:3965915018238312449) 105 # effectively both IDs seem to refer to the same post, so just take the 106 # first one 107 meta_urn = item.get("updateMetadata", {"urn": item.get("preDashEntityUrn")})["urn"] 108 urn = "urn:li:activity:" + meta_urn.split("urn:li:activity:")[1].split(",")[0].split(")")[0] 109 item_id = urn.split(":").pop() 110 111 # the way hashtags were stored changed at some point 112 hashtags = [] 113 if item["commentary"] and "attributes" in item["commentary"]["text"]: 114 hashtags = [tag["trackingUrn"].split(":").pop() for tag in item["commentary"]["text"].get("attributes", []) if tag["type"] == "HASHTAG"] 115 elif item["commentary"] and "attributesV2" in item["commentary"]["text"]: 116 hashtags = [tag["detailData"]["*hashtag"]["trackingUrn"].split(":").pop() for tag in item["commentary"]["text"].get("attributesV2", []) if "*hashtag" in tag["detailData"]] 117 118 # and mentions 119 # we're storing both usernames and full names 120 author_mentions = [] 121 author_name_mentions = [] 122 if item["commentary"] and "attributes" in item["commentary"]["text"]: 123 for mention in item["commentary"]["text"].get("attributes", {}): 124 if mention["type"] == "PROFILE_MENTION": 125 mention = mention["*miniProfile"] 126 author_mentions.append(mention["publicIdentifier"]) 127 author_name_mentions.append(" ".join([mention.get("firstName", ""), mention.get("lastName", "")])) 128 elif mention["type"] == "COMPANY_NAME": 129 mention = mention["*miniCompany"] 130 author_mentions.append(mention["universalName"]) 131 author_name_mentions.append(mention.get("name", "")) 132 133 # same for metrics 134 if "*totalSocialActivityCounts" in item["*socialDetail"]: 135 metrics = { 136 "comments": item["*socialDetail"]["*totalSocialActivityCounts"]["numComments"], 137 "shares": item["*socialDetail"]["*totalSocialActivityCounts"]["numShares"], 138 "reactions": item["*socialDetail"]["*totalSocialActivityCounts"]["numLikes"], 139 "reaction_like": 0, 140 "reaction_empathy": 0, 141 "reaction_praise": 0, 142 "reaction_entertainment": 0, 143 "reaction_appreciation": 0, 144 "reaction_interest": 0 145 } 146 # There's different kind of reaction metrics 147 for reaction_type in item["*socialDetail"]["*totalSocialActivityCounts"].get("reactionTypeCounts", []): 148 metrics["reaction_" + reaction_type["reactionType"].lower()] = reaction_type["count"] 149 150 else: 151 metrics = { 152 "comments": item["*socialDetail"]["comments"]["paging"]["total"], 153 "shares": item["*socialDetail"]["totalShares"], 154 "reactions": item["*socialDetail"]["likes"]["paging"]["total"] 155 } 156 157 # and links 158 link_url = "" 159 if item.get("content") and item["content"].get("navigationContext"): 160 link_url = item["content"]["navigationContext"].get("actionTarget", "") 161 elif item.get("content") and item["content"].get("articleComponent") and "navigationContext" in item["content"]["articleComponent"]: 162 link_url = item["content"]["articleComponent"]["navigationContext"].get("actionTarget", "") 163 164 return MappedItem({ 165 "id": item_id, 166 "thread_id": item_id, 167 "body": item["commentary"]["text"]["text"] if item["commentary"] else "", 168 "timestamp": datetime.datetime.utcfromtimestamp(timestamp).strftime("%Y-%m-%d %H:%M:%S"), 169 "timestamp_collected": datetime.datetime.utcfromtimestamp(time_collected).strftime("%Y-%m-%d %H:%M:%S"), 170 "timestamp_ago": time_ago.split("•")[0].strip(), 171 "is_promoted": "yes" if not re.findall(r"[0-9]", time_ago) else "no", 172 **{("author_" + k).replace("_username", ""): v for k, v in author.items()}, 173 "author_mentions": ",".join(author_mentions), 174 "author_name_mentions": ",".join(author_name_mentions), 175 "hashtags": ",".join(hashtags), 176 "image_urls": ",".join(images), 177 "video_thumb_url": video_thumb_url, 178 "post_url": "https://www.linkedin.com/feed/update/" + urn, 179 "link_url": link_url, 180 **metrics, 181 "inclusion_context": item["header"]["text"]["text"] if item.get("header") else "", 182 "unix_timestamp": timestamp, 183 "unix_timestamp_collected": time_collected 184 }) 185 186 @staticmethod 187 def get_author(post): 188 """ 189 Extract author information from post 190 191 This is a bit complicated because it works differently for companies 192 and users and some fields are not always present. Hence, a separate 193 method. 194 195 :param dict post: Post data 196 :return dict: Author information 197 """ 198 author = { 199 "username": post["actor"]["navigationContext"]["actionTarget"].split("linkedin.com/").pop().split("?")[0], 200 "name": post["actor"]["name"]["text"], 201 "description": post["actor"].get("description", {}).get("text", "") if post["actor"].get("description") else "", 202 "pronouns": "", 203 "avatar_url": "", 204 "is_company": "no", 205 "url": post["actor"]["navigationContext"]["actionTarget"].split("?")[0], 206 } 207 208 # likewise for author avatars 209 if post["actor"]["name"].get("attributes"): 210 if "*miniProfile" in post["actor"]["name"]["attributes"][0]: 211 author_profile = post["actor"]["name"]["attributes"][0]["*miniProfile"] 212 if author_profile["picture"]: 213 avatar_artifacts = sorted(author_profile["picture"]["artifacts"], key=lambda x: x["width"], reverse=True) 214 author.update({"avatar_url": author_profile["picture"]["rootUrl"] + avatar_artifacts[0]["fileIdentifyingUrlPathSegment"]}) 215 216 if author_profile.get("customPronoun"): 217 author.update({"pronouns": author_profile.get("customPronoun")}) 218 elif author_profile.get("standardizedPronoun"): 219 author.update({"pronouns": author_profile.get("standardizedPronoun").lower()}) 220 221 elif "*miniCompany" in post["actor"]["name"]["attributes"][0]: 222 author_profile = post["actor"]["name"]["attributes"][0]["*miniCompany"] 223 avatar_artifacts = sorted(author_profile["logo"]["artifacts"], key=lambda x: x["width"], reverse=True) 224 225 author.update({"is_company": "yes"}) 226 author.update({"avatar_url": author_profile["logo"]["rootUrl"] + avatar_artifacts[0]["fileIdentifyingUrlPathSegment"]}) 227 228 if post["actor"]["name"].get("attributesV2"): 229 pronouns = post["actor"]["name"]["attributesV2"][0]["detailData"].get("*profileFullName", {}).get("pronoun") 230 if pronouns: 231 if pronouns.get("customPronoun"): 232 author.update({"pronouns": pronouns.get("customPronoun")}) 233 elif pronouns.get("standardizedPronoun"): 234 author.update({"pronouns": pronouns.get("standardizedPronoun")}) 235 236 avatar = post["actor"]["image"].get("attributes", [{}])[0].get("detailData", {}).get("nonEntityProfilePicture") 237 if avatar and avatar["vectorImage"]: 238 author.update({"avatar_url": avatar["vectorImage"]["rootUrl"] + avatar["vectorImage"]["artifacts"][0]["fileIdentifyingUrlPathSegment"]}) 239 240 return author 241 242 @staticmethod 243 def parse_time_ago(time_ago): 244 """ 245 Attempt to parse a timestamp for a post 246 247 LinkedIn doesn't give us the actual timestamp, only a relative 248 indicator like "18h ago". This is annoying because it gets more 249 imprecise the longer ago it is, and because it is language-sensitive. 250 For example, in English 18 months is displayed as "18mo" but in Dutch 251 it is "18 mnd". 252 253 Right now this will only adjust the 'collected at' timestamp if the 254 data was scraped from an English or Dutch interface, and even then the 255 timestamps will still be imprecise. 256 257 :param str time_ago: Relative timestamp, e.g. '18mo'. 258 :return int: Estimated timestamp of post, as unix timestamp 259 """ 260 time_ago = time_ago.split("•")[0] 261 numbers = re.sub(r"[^0-9]", "", time_ago).strip() 262 letters = re.sub(r"[0-9]", "", time_ago).strip() 263 264 period_lengths = { 265 "s": 1, 266 "m": 60, 267 "h": 3600, 268 "d": 86400, 269 "w": 7 * 86400, 270 "mo": 30.4375 * 86400, # we don't know WHICH months, so use the average length of a month 271 "mnd": 30.4375 * 86400, 272 "yr": 365.25 * 86400, # likewise 273 "j": 365.25 * 86400, 274 } 275 276 numbers = int(numbers) if len(numbers) else 0 277 return period_lengths.get(letters, 0) * numbers
16class SearchLinkedIn(Search): 17 """ 18 Import scraped LinkedIn data 19 """ 20 type = "linkedin-search" # job ID 21 category = "Search" # category 22 title = "Import scraped LinkedIn data" # title displayed in UI 23 description = "Import LinkedIn data collected with an external tool such as Zeeschuimer." # description displayed in UI 24 extension = "ndjson" # extension of result file, used internally and in UI 25 is_from_zeeschuimer = True 26 27 # not available as a processor for existing datasets 28 accepts = [None] 29 references = [ 30 "[Zeeschuimer browser extension](https://github.com/digitalmethodsinitiative/zeeschuimer)", 31 "[Worksheet: Capturing TikTok data with Zeeschuimer and 4CAT](https://tinyurl.com/nmrw-zeeschuimer-tiktok) (also explains general usage of Zeeschuimer)" 32 ] 33 34 def get_items(self, query): 35 """ 36 Run custom search 37 38 Not available for LinkedIn 39 """ 40 raise NotImplementedError("LinkedIn datasets can only be created by importing data from elsewhere") 41 42 @staticmethod 43 def map_item(item): 44 """ 45 Parse LinkedIn post in Voyager V2 format 46 47 'Voyager V2' seems to be how the format is referred to in the data 48 itself... 49 50 :param item: Data as received from LinkedIn 51 :return dict: Mapped item 52 """ 53 54 # annoyingly, posts don't come with a timestamp 55 # approximate it by using the time of collection and the "time ago" 56 # included with the post (e.g. 'published 18h ago') 57 if not item.get("actor"): 58 return {} 59 60 if "__import_meta" in item: 61 time_collected = int(item["__import_meta"]["timestamp_collected"] / 1000) # milliseconds 62 else: 63 # best we got 64 time_collected = int(time.time()) 65 66 time_ago = item["actor"]["subDescription"]["text"] if item["actor"].get("subDescription") else "" 67 timestamp = int(time_collected - SearchLinkedIn.parse_time_ago(time_ago)) 68 69 # images are stored in some convoluted way 70 # there are multiple URLs for various thumbnails, use the one for the 71 # largest version of the image 72 images = [] 73 if item["content"] and "images" in item["content"]: 74 for image in item["content"]["images"]: 75 image_data = image["attributes"][0]["vectorImage"] 76 artifacts = sorted(image_data["artifacts"], key=lambda x: x["width"], reverse=True) 77 url = image_data["rootUrl"] + artifacts[0]["fileIdentifyingUrlPathSegment"] 78 images.append(url) 79 80 # or alternatively they are stored here: 81 if not images and item["content"] and item["content"].get("articleComponent") and item["content"]["articleComponent"].get("largeImage"): 82 image = item["content"]["articleComponent"]["largeImage"]["attributes"][0]["detailData"]["vectorImage"] 83 if not image and item["content"]["articleComponent"]["largeImage"]["attributes"][0]["imageUrl"]: 84 images.append(item["content"]["articleComponent"]["largeImage"]["attributes"][0]["imageUrl"]["url"]) 85 elif image and image.get("artifacts"): 86 images.append(image["rootUrl"] + image["artifacts"][0]["fileIdentifyingUrlPathSegment"]) 87 88 # video thumbnails are stored similarly as image data 89 video_thumb_url = "" 90 thumb_content = None 91 if item["content"] and "*videoPlayMetadata" in item["content"]: 92 thumb_content = item["content"]["*videoPlayMetadata"]["thumbnail"] 93 elif item["content"] and "linkedInVideoComponent" in item["content"] and item["content"]["linkedInVideoComponent"]: 94 thumb_content = item["content"]["linkedInVideoComponent"]["*videoPlayMetadata"]["thumbnail"] 95 elif item["content"] and "externalVideoComponent" in item["content"] and item["content"]["externalVideoComponent"]: 96 thumb_content = item["content"]["externalVideoComponent"]["*videoPlayMetadata"]["thumbnail"] 97 if thumb_content: 98 video_thumb_url = thumb_content["rootUrl"] + thumb_content["artifacts"][0]["fileIdentifyingUrlPathSegment"] 99 100 author = SearchLinkedIn.get_author(item) 101 102 # the ID is in the format 'urn:li:activity:6960882777168695296' 103 # retain the numerical part as the item ID for 4CAT 104 # sometimes posts seem to be combined, e.g.: 105 # urn:li:aggregate:(urn:li:activity:3966023054712791616,urn:li:activity:3965915018238312449) 106 # effectively both IDs seem to refer to the same post, so just take the 107 # first one 108 meta_urn = item.get("updateMetadata", {"urn": item.get("preDashEntityUrn")})["urn"] 109 urn = "urn:li:activity:" + meta_urn.split("urn:li:activity:")[1].split(",")[0].split(")")[0] 110 item_id = urn.split(":").pop() 111 112 # the way hashtags were stored changed at some point 113 hashtags = [] 114 if item["commentary"] and "attributes" in item["commentary"]["text"]: 115 hashtags = [tag["trackingUrn"].split(":").pop() for tag in item["commentary"]["text"].get("attributes", []) if tag["type"] == "HASHTAG"] 116 elif item["commentary"] and "attributesV2" in item["commentary"]["text"]: 117 hashtags = [tag["detailData"]["*hashtag"]["trackingUrn"].split(":").pop() for tag in item["commentary"]["text"].get("attributesV2", []) if "*hashtag" in tag["detailData"]] 118 119 # and mentions 120 # we're storing both usernames and full names 121 author_mentions = [] 122 author_name_mentions = [] 123 if item["commentary"] and "attributes" in item["commentary"]["text"]: 124 for mention in item["commentary"]["text"].get("attributes", {}): 125 if mention["type"] == "PROFILE_MENTION": 126 mention = mention["*miniProfile"] 127 author_mentions.append(mention["publicIdentifier"]) 128 author_name_mentions.append(" ".join([mention.get("firstName", ""), mention.get("lastName", "")])) 129 elif mention["type"] == "COMPANY_NAME": 130 mention = mention["*miniCompany"] 131 author_mentions.append(mention["universalName"]) 132 author_name_mentions.append(mention.get("name", "")) 133 134 # same for metrics 135 if "*totalSocialActivityCounts" in item["*socialDetail"]: 136 metrics = { 137 "comments": item["*socialDetail"]["*totalSocialActivityCounts"]["numComments"], 138 "shares": item["*socialDetail"]["*totalSocialActivityCounts"]["numShares"], 139 "reactions": item["*socialDetail"]["*totalSocialActivityCounts"]["numLikes"], 140 "reaction_like": 0, 141 "reaction_empathy": 0, 142 "reaction_praise": 0, 143 "reaction_entertainment": 0, 144 "reaction_appreciation": 0, 145 "reaction_interest": 0 146 } 147 # There's different kind of reaction metrics 148 for reaction_type in item["*socialDetail"]["*totalSocialActivityCounts"].get("reactionTypeCounts", []): 149 metrics["reaction_" + reaction_type["reactionType"].lower()] = reaction_type["count"] 150 151 else: 152 metrics = { 153 "comments": item["*socialDetail"]["comments"]["paging"]["total"], 154 "shares": item["*socialDetail"]["totalShares"], 155 "reactions": item["*socialDetail"]["likes"]["paging"]["total"] 156 } 157 158 # and links 159 link_url = "" 160 if item.get("content") and item["content"].get("navigationContext"): 161 link_url = item["content"]["navigationContext"].get("actionTarget", "") 162 elif item.get("content") and item["content"].get("articleComponent") and "navigationContext" in item["content"]["articleComponent"]: 163 link_url = item["content"]["articleComponent"]["navigationContext"].get("actionTarget", "") 164 165 return MappedItem({ 166 "id": item_id, 167 "thread_id": item_id, 168 "body": item["commentary"]["text"]["text"] if item["commentary"] else "", 169 "timestamp": datetime.datetime.utcfromtimestamp(timestamp).strftime("%Y-%m-%d %H:%M:%S"), 170 "timestamp_collected": datetime.datetime.utcfromtimestamp(time_collected).strftime("%Y-%m-%d %H:%M:%S"), 171 "timestamp_ago": time_ago.split("•")[0].strip(), 172 "is_promoted": "yes" if not re.findall(r"[0-9]", time_ago) else "no", 173 **{("author_" + k).replace("_username", ""): v for k, v in author.items()}, 174 "author_mentions": ",".join(author_mentions), 175 "author_name_mentions": ",".join(author_name_mentions), 176 "hashtags": ",".join(hashtags), 177 "image_urls": ",".join(images), 178 "video_thumb_url": video_thumb_url, 179 "post_url": "https://www.linkedin.com/feed/update/" + urn, 180 "link_url": link_url, 181 **metrics, 182 "inclusion_context": item["header"]["text"]["text"] if item.get("header") else "", 183 "unix_timestamp": timestamp, 184 "unix_timestamp_collected": time_collected 185 }) 186 187 @staticmethod 188 def get_author(post): 189 """ 190 Extract author information from post 191 192 This is a bit complicated because it works differently for companies 193 and users and some fields are not always present. Hence, a separate 194 method. 195 196 :param dict post: Post data 197 :return dict: Author information 198 """ 199 author = { 200 "username": post["actor"]["navigationContext"]["actionTarget"].split("linkedin.com/").pop().split("?")[0], 201 "name": post["actor"]["name"]["text"], 202 "description": post["actor"].get("description", {}).get("text", "") if post["actor"].get("description") else "", 203 "pronouns": "", 204 "avatar_url": "", 205 "is_company": "no", 206 "url": post["actor"]["navigationContext"]["actionTarget"].split("?")[0], 207 } 208 209 # likewise for author avatars 210 if post["actor"]["name"].get("attributes"): 211 if "*miniProfile" in post["actor"]["name"]["attributes"][0]: 212 author_profile = post["actor"]["name"]["attributes"][0]["*miniProfile"] 213 if author_profile["picture"]: 214 avatar_artifacts = sorted(author_profile["picture"]["artifacts"], key=lambda x: x["width"], reverse=True) 215 author.update({"avatar_url": author_profile["picture"]["rootUrl"] + avatar_artifacts[0]["fileIdentifyingUrlPathSegment"]}) 216 217 if author_profile.get("customPronoun"): 218 author.update({"pronouns": author_profile.get("customPronoun")}) 219 elif author_profile.get("standardizedPronoun"): 220 author.update({"pronouns": author_profile.get("standardizedPronoun").lower()}) 221 222 elif "*miniCompany" in post["actor"]["name"]["attributes"][0]: 223 author_profile = post["actor"]["name"]["attributes"][0]["*miniCompany"] 224 avatar_artifacts = sorted(author_profile["logo"]["artifacts"], key=lambda x: x["width"], reverse=True) 225 226 author.update({"is_company": "yes"}) 227 author.update({"avatar_url": author_profile["logo"]["rootUrl"] + avatar_artifacts[0]["fileIdentifyingUrlPathSegment"]}) 228 229 if post["actor"]["name"].get("attributesV2"): 230 pronouns = post["actor"]["name"]["attributesV2"][0]["detailData"].get("*profileFullName", {}).get("pronoun") 231 if pronouns: 232 if pronouns.get("customPronoun"): 233 author.update({"pronouns": pronouns.get("customPronoun")}) 234 elif pronouns.get("standardizedPronoun"): 235 author.update({"pronouns": pronouns.get("standardizedPronoun")}) 236 237 avatar = post["actor"]["image"].get("attributes", [{}])[0].get("detailData", {}).get("nonEntityProfilePicture") 238 if avatar and avatar["vectorImage"]: 239 author.update({"avatar_url": avatar["vectorImage"]["rootUrl"] + avatar["vectorImage"]["artifacts"][0]["fileIdentifyingUrlPathSegment"]}) 240 241 return author 242 243 @staticmethod 244 def parse_time_ago(time_ago): 245 """ 246 Attempt to parse a timestamp for a post 247 248 LinkedIn doesn't give us the actual timestamp, only a relative 249 indicator like "18h ago". This is annoying because it gets more 250 imprecise the longer ago it is, and because it is language-sensitive. 251 For example, in English 18 months is displayed as "18mo" but in Dutch 252 it is "18 mnd". 253 254 Right now this will only adjust the 'collected at' timestamp if the 255 data was scraped from an English or Dutch interface, and even then the 256 timestamps will still be imprecise. 257 258 :param str time_ago: Relative timestamp, e.g. '18mo'. 259 :return int: Estimated timestamp of post, as unix timestamp 260 """ 261 time_ago = time_ago.split("•")[0] 262 numbers = re.sub(r"[^0-9]", "", time_ago).strip() 263 letters = re.sub(r"[0-9]", "", time_ago).strip() 264 265 period_lengths = { 266 "s": 1, 267 "m": 60, 268 "h": 3600, 269 "d": 86400, 270 "w": 7 * 86400, 271 "mo": 30.4375 * 86400, # we don't know WHICH months, so use the average length of a month 272 "mnd": 30.4375 * 86400, 273 "yr": 365.25 * 86400, # likewise 274 "j": 365.25 * 86400, 275 } 276 277 numbers = int(numbers) if len(numbers) else 0 278 return period_lengths.get(letters, 0) * numbers
Import scraped LinkedIn data
34 def get_items(self, query): 35 """ 36 Run custom search 37 38 Not available for LinkedIn 39 """ 40 raise NotImplementedError("LinkedIn datasets can only be created by importing data from elsewhere")
Run custom search
Not available for LinkedIn
42 @staticmethod 43 def map_item(item): 44 """ 45 Parse LinkedIn post in Voyager V2 format 46 47 'Voyager V2' seems to be how the format is referred to in the data 48 itself... 49 50 :param item: Data as received from LinkedIn 51 :return dict: Mapped item 52 """ 53 54 # annoyingly, posts don't come with a timestamp 55 # approximate it by using the time of collection and the "time ago" 56 # included with the post (e.g. 'published 18h ago') 57 if not item.get("actor"): 58 return {} 59 60 if "__import_meta" in item: 61 time_collected = int(item["__import_meta"]["timestamp_collected"] / 1000) # milliseconds 62 else: 63 # best we got 64 time_collected = int(time.time()) 65 66 time_ago = item["actor"]["subDescription"]["text"] if item["actor"].get("subDescription") else "" 67 timestamp = int(time_collected - SearchLinkedIn.parse_time_ago(time_ago)) 68 69 # images are stored in some convoluted way 70 # there are multiple URLs for various thumbnails, use the one for the 71 # largest version of the image 72 images = [] 73 if item["content"] and "images" in item["content"]: 74 for image in item["content"]["images"]: 75 image_data = image["attributes"][0]["vectorImage"] 76 artifacts = sorted(image_data["artifacts"], key=lambda x: x["width"], reverse=True) 77 url = image_data["rootUrl"] + artifacts[0]["fileIdentifyingUrlPathSegment"] 78 images.append(url) 79 80 # or alternatively they are stored here: 81 if not images and item["content"] and item["content"].get("articleComponent") and item["content"]["articleComponent"].get("largeImage"): 82 image = item["content"]["articleComponent"]["largeImage"]["attributes"][0]["detailData"]["vectorImage"] 83 if not image and item["content"]["articleComponent"]["largeImage"]["attributes"][0]["imageUrl"]: 84 images.append(item["content"]["articleComponent"]["largeImage"]["attributes"][0]["imageUrl"]["url"]) 85 elif image and image.get("artifacts"): 86 images.append(image["rootUrl"] + image["artifacts"][0]["fileIdentifyingUrlPathSegment"]) 87 88 # video thumbnails are stored similarly as image data 89 video_thumb_url = "" 90 thumb_content = None 91 if item["content"] and "*videoPlayMetadata" in item["content"]: 92 thumb_content = item["content"]["*videoPlayMetadata"]["thumbnail"] 93 elif item["content"] and "linkedInVideoComponent" in item["content"] and item["content"]["linkedInVideoComponent"]: 94 thumb_content = item["content"]["linkedInVideoComponent"]["*videoPlayMetadata"]["thumbnail"] 95 elif item["content"] and "externalVideoComponent" in item["content"] and item["content"]["externalVideoComponent"]: 96 thumb_content = item["content"]["externalVideoComponent"]["*videoPlayMetadata"]["thumbnail"] 97 if thumb_content: 98 video_thumb_url = thumb_content["rootUrl"] + thumb_content["artifacts"][0]["fileIdentifyingUrlPathSegment"] 99 100 author = SearchLinkedIn.get_author(item) 101 102 # the ID is in the format 'urn:li:activity:6960882777168695296' 103 # retain the numerical part as the item ID for 4CAT 104 # sometimes posts seem to be combined, e.g.: 105 # urn:li:aggregate:(urn:li:activity:3966023054712791616,urn:li:activity:3965915018238312449) 106 # effectively both IDs seem to refer to the same post, so just take the 107 # first one 108 meta_urn = item.get("updateMetadata", {"urn": item.get("preDashEntityUrn")})["urn"] 109 urn = "urn:li:activity:" + meta_urn.split("urn:li:activity:")[1].split(",")[0].split(")")[0] 110 item_id = urn.split(":").pop() 111 112 # the way hashtags were stored changed at some point 113 hashtags = [] 114 if item["commentary"] and "attributes" in item["commentary"]["text"]: 115 hashtags = [tag["trackingUrn"].split(":").pop() for tag in item["commentary"]["text"].get("attributes", []) if tag["type"] == "HASHTAG"] 116 elif item["commentary"] and "attributesV2" in item["commentary"]["text"]: 117 hashtags = [tag["detailData"]["*hashtag"]["trackingUrn"].split(":").pop() for tag in item["commentary"]["text"].get("attributesV2", []) if "*hashtag" in tag["detailData"]] 118 119 # and mentions 120 # we're storing both usernames and full names 121 author_mentions = [] 122 author_name_mentions = [] 123 if item["commentary"] and "attributes" in item["commentary"]["text"]: 124 for mention in item["commentary"]["text"].get("attributes", {}): 125 if mention["type"] == "PROFILE_MENTION": 126 mention = mention["*miniProfile"] 127 author_mentions.append(mention["publicIdentifier"]) 128 author_name_mentions.append(" ".join([mention.get("firstName", ""), mention.get("lastName", "")])) 129 elif mention["type"] == "COMPANY_NAME": 130 mention = mention["*miniCompany"] 131 author_mentions.append(mention["universalName"]) 132 author_name_mentions.append(mention.get("name", "")) 133 134 # same for metrics 135 if "*totalSocialActivityCounts" in item["*socialDetail"]: 136 metrics = { 137 "comments": item["*socialDetail"]["*totalSocialActivityCounts"]["numComments"], 138 "shares": item["*socialDetail"]["*totalSocialActivityCounts"]["numShares"], 139 "reactions": item["*socialDetail"]["*totalSocialActivityCounts"]["numLikes"], 140 "reaction_like": 0, 141 "reaction_empathy": 0, 142 "reaction_praise": 0, 143 "reaction_entertainment": 0, 144 "reaction_appreciation": 0, 145 "reaction_interest": 0 146 } 147 # There's different kind of reaction metrics 148 for reaction_type in item["*socialDetail"]["*totalSocialActivityCounts"].get("reactionTypeCounts", []): 149 metrics["reaction_" + reaction_type["reactionType"].lower()] = reaction_type["count"] 150 151 else: 152 metrics = { 153 "comments": item["*socialDetail"]["comments"]["paging"]["total"], 154 "shares": item["*socialDetail"]["totalShares"], 155 "reactions": item["*socialDetail"]["likes"]["paging"]["total"] 156 } 157 158 # and links 159 link_url = "" 160 if item.get("content") and item["content"].get("navigationContext"): 161 link_url = item["content"]["navigationContext"].get("actionTarget", "") 162 elif item.get("content") and item["content"].get("articleComponent") and "navigationContext" in item["content"]["articleComponent"]: 163 link_url = item["content"]["articleComponent"]["navigationContext"].get("actionTarget", "") 164 165 return MappedItem({ 166 "id": item_id, 167 "thread_id": item_id, 168 "body": item["commentary"]["text"]["text"] if item["commentary"] else "", 169 "timestamp": datetime.datetime.utcfromtimestamp(timestamp).strftime("%Y-%m-%d %H:%M:%S"), 170 "timestamp_collected": datetime.datetime.utcfromtimestamp(time_collected).strftime("%Y-%m-%d %H:%M:%S"), 171 "timestamp_ago": time_ago.split("•")[0].strip(), 172 "is_promoted": "yes" if not re.findall(r"[0-9]", time_ago) else "no", 173 **{("author_" + k).replace("_username", ""): v for k, v in author.items()}, 174 "author_mentions": ",".join(author_mentions), 175 "author_name_mentions": ",".join(author_name_mentions), 176 "hashtags": ",".join(hashtags), 177 "image_urls": ",".join(images), 178 "video_thumb_url": video_thumb_url, 179 "post_url": "https://www.linkedin.com/feed/update/" + urn, 180 "link_url": link_url, 181 **metrics, 182 "inclusion_context": item["header"]["text"]["text"] if item.get("header") else "", 183 "unix_timestamp": timestamp, 184 "unix_timestamp_collected": time_collected 185 })
Parse LinkedIn post in Voyager V2 format
'Voyager V2' seems to be how the format is referred to in the data itself...
Parameters
- item: Data as received from LinkedIn
Returns
Mapped item
243 @staticmethod 244 def parse_time_ago(time_ago): 245 """ 246 Attempt to parse a timestamp for a post 247 248 LinkedIn doesn't give us the actual timestamp, only a relative 249 indicator like "18h ago". This is annoying because it gets more 250 imprecise the longer ago it is, and because it is language-sensitive. 251 For example, in English 18 months is displayed as "18mo" but in Dutch 252 it is "18 mnd". 253 254 Right now this will only adjust the 'collected at' timestamp if the 255 data was scraped from an English or Dutch interface, and even then the 256 timestamps will still be imprecise. 257 258 :param str time_ago: Relative timestamp, e.g. '18mo'. 259 :return int: Estimated timestamp of post, as unix timestamp 260 """ 261 time_ago = time_ago.split("•")[0] 262 numbers = re.sub(r"[^0-9]", "", time_ago).strip() 263 letters = re.sub(r"[0-9]", "", time_ago).strip() 264 265 period_lengths = { 266 "s": 1, 267 "m": 60, 268 "h": 3600, 269 "d": 86400, 270 "w": 7 * 86400, 271 "mo": 30.4375 * 86400, # we don't know WHICH months, so use the average length of a month 272 "mnd": 30.4375 * 86400, 273 "yr": 365.25 * 86400, # likewise 274 "j": 365.25 * 86400, 275 } 276 277 numbers = int(numbers) if len(numbers) else 0 278 return period_lengths.get(letters, 0) * numbers
Attempt to parse a timestamp for a post
LinkedIn doesn't give us the actual timestamp, only a relative indicator like "18h ago". This is annoying because it gets more imprecise the longer ago it is, and because it is language-sensitive. For example, in English 18 months is displayed as "18mo" but in Dutch it is "18 mnd".
Right now this will only adjust the 'collected at' timestamp if the data was scraped from an English or Dutch interface, and even then the timestamps will still be imprecise.
Parameters
- str time_ago: Relative timestamp, e.g. '18mo'.
Returns
Estimated timestamp of post, as unix timestamp
Inherited Members
- backend.lib.worker.BasicWorker
- BasicWorker
- INTERRUPT_NONE
- INTERRUPT_RETRY
- INTERRUPT_CANCEL
- queue
- log
- manager
- interrupted
- modules
- init_time
- name
- run
- clean_up
- request_interrupt
- is_4cat_class
- backend.lib.search.Search
- max_workers
- prefix
- return_cols
- import_error_count
- import_warning_count
- process
- search
- import_from_file
- items_to_csv
- items_to_ndjson
- items_to_archive
- backend.lib.processor.BasicProcessor
- db
- job
- dataset
- owner
- source_dataset
- source_file
- config
- is_running_in_preset
- filepath
- work
- after_process
- remove_files
- abort
- iterate_proxied_requests
- push_proxied_request
- flush_proxied_requests
- iterate_archive_contents
- unpack_archive_contents
- extract_archived_file_by_name
- write_csv_items_and_finish
- write_archive_and_finish
- create_standalone
- save_annotations
- map_item_method_available
- get_mapped_item
- is_filter
- get_options
- get_status
- is_top_dataset
- is_from_collector
- get_extension
- is_rankable
- exclude_followup_processors
- is_4cat_processor