Edit on GitHub

datasources.linkedin.search_linkedin

Import scraped LinkedIn data

It's prohibitively difficult to scrape data from LinkedIn within 4CAT itself due to its aggressive rate limiting and login wall. Instead, import data collected elsewhere.

View Source

  1"""
  2Import scraped LinkedIn data
  3
  4It's prohibitively difficult to scrape data from LinkedIn within 4CAT itself
  5due to its aggressive rate limiting and login wall. Instead, import data
  6collected elsewhere.
  7"""
  8import datetime
  9import time
 10import re
 11
 12from backend.lib.search import Search
 13from common.lib.item_mapping import MappedItem
 14
 15class SearchLinkedIn(Search):
 16    """
 17    Import scraped LinkedIn data
 18    """
 19    type = "linkedin-search"  # job ID
 20    category = "Search"  # category
 21    title = "Import scraped LinkedIn data"  # title displayed in UI
 22    description = "Import LinkedIn data collected with an external tool such as Zeeschuimer."  # description displayed in UI
 23    extension = "ndjson"  # extension of result file, used internally and in UI
 24    is_from_zeeschuimer = True
 25
 26    # not available as a processor for existing datasets
 27    accepts = [None]
 28    references = [
 29        "[Zeeschuimer browser extension](https://github.com/digitalmethodsinitiative/zeeschuimer)",
 30        "[Worksheet: Capturing TikTok data with Zeeschuimer and 4CAT](https://tinyurl.com/nmrw-zeeschuimer-tiktok) (also explains general usage of Zeeschuimer)"
 31    ]
 32
 33    def get_items(self, query):
 34        """
 35        Run custom search
 36
 37        Not available for LinkedIn
 38        """
 39        raise NotImplementedError("LinkedIn datasets can only be created by importing data from elsewhere")
 40
 41    @staticmethod
 42    def map_item(item):
 43        """
 44        Parse LinkedIn post in Voyager V2 format
 45
 46        'Voyager V2' seems to be how the format is referred to in the data
 47        itself...
 48
 49        :param item:  Data as received from LinkedIn
 50        :return dict:  Mapped item
 51        """
 52
 53        # annoyingly, posts don't come with a timestamp
 54        # approximate it by using the time of collection and the "time ago"
 55        # included with the post (e.g. 'published 18h ago')
 56        if not item.get("actor"):
 57            return {}
 58
 59        if "__import_meta" in item:
 60            time_collected = int(item["__import_meta"]["timestamp_collected"] / 1000)  # milliseconds
 61        else:
 62            # best we got
 63            time_collected = int(time.time())
 64
 65        time_ago = item["actor"]["subDescription"]["text"] if item["actor"].get("subDescription") else ""
 66        timestamp = int(time_collected - SearchLinkedIn.parse_time_ago(time_ago))
 67
 68        # images are stored in some convoluted way
 69        # there are multiple URLs for various thumbnails, use the one for the
 70        # largest version of the image
 71        images = []
 72        if item["content"] and "images" in item["content"]:
 73            for image in item["content"]["images"]:
 74                image_data = image["attributes"][0]["vectorImage"]
 75                artifacts = sorted(image_data["artifacts"], key=lambda x: x["width"], reverse=True)
 76                url = image_data["rootUrl"] + artifacts[0]["fileIdentifyingUrlPathSegment"]
 77                images.append(url)
 78
 79        # or alternatively they are stored here:
 80        if not images and item["content"] and item["content"].get("articleComponent") and item["content"]["articleComponent"].get("largeImage"):
 81            image = item["content"]["articleComponent"]["largeImage"]["attributes"][0]["detailData"]["vectorImage"]
 82            if not image and item["content"]["articleComponent"]["largeImage"]["attributes"][0]["imageUrl"]:
 83                images.append(item["content"]["articleComponent"]["largeImage"]["attributes"][0]["imageUrl"]["url"])
 84            elif image and image.get("artifacts"):
 85                images.append(image["rootUrl"] + image["artifacts"][0]["fileIdentifyingUrlPathSegment"])
 86
 87        # video thumbnails are stored similarly as image data
 88        video_thumb_url = ""
 89        thumb_content = None
 90        if item["content"] and "*videoPlayMetadata" in item["content"]:
 91            thumb_content = item["content"]["*videoPlayMetadata"]["thumbnail"]
 92        elif item["content"] and "linkedInVideoComponent" in item["content"] and item["content"]["linkedInVideoComponent"]:
 93            thumb_content = item["content"]["linkedInVideoComponent"]["*videoPlayMetadata"]["thumbnail"]
 94        elif item["content"] and "externalVideoComponent" in item["content"] and item["content"]["externalVideoComponent"]:
 95            thumb_content = item["content"]["externalVideoComponent"]["*videoPlayMetadata"]["thumbnail"]
 96        if thumb_content:
 97            video_thumb_url = thumb_content["rootUrl"] + thumb_content["artifacts"][0]["fileIdentifyingUrlPathSegment"]
 98
 99        author = SearchLinkedIn.get_author(item)
100
101        # the ID is in the format 'urn:li:activity:6960882777168695296'
102        # retain the numerical part as the item ID for 4CAT
103        # sometimes posts seem to be combined, e.g.:
104        # urn:li:aggregate:(urn:li:activity:3966023054712791616,urn:li:activity:3965915018238312449)
105        # effectively both IDs seem to refer to the same post, so just take the
106        # first one
107        meta_urn = item.get("updateMetadata", {"urn": item.get("preDashEntityUrn")})["urn"]
108        urn = "urn:li:activity:" + meta_urn.split("urn:li:activity:")[1].split(",")[0].split(")")[0]
109        item_id = urn.split(":").pop()
110
111        # the way hashtags were stored changed at some point
112        hashtags = []
113        if item["commentary"] and "attributes" in item["commentary"]["text"]:
114            hashtags = [tag["trackingUrn"].split(":").pop() for tag in item["commentary"]["text"].get("attributes", []) if tag["type"] == "HASHTAG"]
115        elif item["commentary"] and "attributesV2" in item["commentary"]["text"]:
116            hashtags = [tag["detailData"]["*hashtag"]["trackingUrn"].split(":").pop() for tag in item["commentary"]["text"].get("attributesV2", []) if "*hashtag" in tag["detailData"]]
117
118        # and mentions
119        # we're storing both usernames and full names
120        author_mentions = []
121        author_name_mentions = []
122        if item["commentary"] and "attributes" in item["commentary"]["text"]:
123            for mention in item["commentary"]["text"].get("attributes", {}):
124                if mention["type"] == "PROFILE_MENTION":
125                    mention = mention["*miniProfile"]
126                    author_mentions.append(mention["publicIdentifier"])         
127                    author_name_mentions.append(" ".join([mention.get("firstName", ""), mention.get("lastName", "")]))
128                elif mention["type"] == "COMPANY_NAME":
129                    mention = mention["*miniCompany"]
130                    author_mentions.append(mention["universalName"])         
131                    author_name_mentions.append(mention.get("name", ""))
132
133        # same for metrics
134        if "*totalSocialActivityCounts" in item["*socialDetail"]:
135            metrics = {
136                "comments": item["*socialDetail"]["*totalSocialActivityCounts"]["numComments"],
137                "shares": item["*socialDetail"]["*totalSocialActivityCounts"]["numShares"],
138                "reactions": item["*socialDetail"]["*totalSocialActivityCounts"]["numLikes"],
139                "reaction_like": 0,
140                "reaction_empathy": 0,
141                "reaction_praise": 0,
142                "reaction_entertainment": 0,
143                "reaction_appreciation": 0,
144                "reaction_interest": 0
145                }
146            # There's different kind of reaction metrics
147            for reaction_type in item["*socialDetail"]["*totalSocialActivityCounts"].get("reactionTypeCounts", []):
148                metrics["reaction_" + reaction_type["reactionType"].lower()] = reaction_type["count"]
149
150        else:
151            metrics = {
152                "comments": item["*socialDetail"]["comments"]["paging"]["total"],
153                "shares": item["*socialDetail"]["totalShares"],
154                "reactions": item["*socialDetail"]["likes"]["paging"]["total"]
155            }
156
157        # and links
158        link_url = ""
159        if item.get("content") and item["content"].get("navigationContext"):
160            link_url = item["content"]["navigationContext"].get("actionTarget", "")
161        elif item.get("content") and item["content"].get("articleComponent") and "navigationContext" in item["content"]["articleComponent"]:
162            link_url = item["content"]["articleComponent"]["navigationContext"].get("actionTarget", "")
163
164        return MappedItem({
165            "id": item_id,
166            "thread_id": item_id,
167            "body": item["commentary"]["text"]["text"] if item["commentary"] else "",
168            "timestamp": datetime.datetime.utcfromtimestamp(timestamp).strftime("%Y-%m-%d %H:%M:%S"),
169            "timestamp_collected": datetime.datetime.utcfromtimestamp(time_collected).strftime("%Y-%m-%d %H:%M:%S"),
170            "timestamp_ago": time_ago.split("•")[0].strip(),
171            "is_promoted": "yes" if not re.findall(r"[0-9]", time_ago) else "no",
172            **{("author_" + k).replace("_username", ""): v for k, v in author.items()},
173            "author_mentions": ",".join(author_mentions),
174            "author_name_mentions": ",".join(author_name_mentions),
175            "hashtags": ",".join(hashtags),
176            "image_urls": ",".join(images),
177            "video_thumb_url": video_thumb_url,
178            "post_url": "https://www.linkedin.com/feed/update/" + urn,
179            "link_url":  link_url,
180            **metrics,
181            "inclusion_context": item["header"]["text"]["text"] if item.get("header") else "",
182            "unix_timestamp": timestamp,
183            "unix_timestamp_collected": time_collected
184        })
185
186    @staticmethod
187    def get_author(post):
188        """
189        Extract author information from post
190
191        This is a bit complicated because it works differently for companies
192        and users and some fields are not always present. Hence, a separate
193        method.
194
195        :param dict post:  Post data
196        :return dict:  Author information
197        """
198        author = {
199            "username": post["actor"]["navigationContext"]["actionTarget"].split("linkedin.com/").pop().split("?")[0],
200            "name": post["actor"]["name"]["text"],
201            "description": post["actor"].get("description", {}).get("text", "") if post["actor"].get("description") else "",
202            "pronouns": "",
203            "avatar_url": "",
204            "is_company": "no",
205            "url": post["actor"]["navigationContext"]["actionTarget"].split("?")[0],
206        }
207
208        # likewise for author avatars
209        if post["actor"]["name"].get("attributes"):
210            if "*miniProfile" in post["actor"]["name"]["attributes"][0]:
211                author_profile = post["actor"]["name"]["attributes"][0]["*miniProfile"]
212                if author_profile["picture"]:
213                    avatar_artifacts = sorted(author_profile["picture"]["artifacts"], key=lambda x: x["width"], reverse=True)
214                    author.update({"avatar_url": author_profile["picture"]["rootUrl"] + avatar_artifacts[0]["fileIdentifyingUrlPathSegment"]})
215
216                if author_profile.get("customPronoun"):
217                    author.update({"pronouns": author_profile.get("customPronoun")})
218                elif author_profile.get("standardizedPronoun"):
219                    author.update({"pronouns": author_profile.get("standardizedPronoun").lower()})
220
221            elif "*miniCompany" in post["actor"]["name"]["attributes"][0]:
222                author_profile = post["actor"]["name"]["attributes"][0]["*miniCompany"]
223                avatar_artifacts = sorted(author_profile["logo"]["artifacts"], key=lambda x: x["width"], reverse=True)
224
225                author.update({"is_company": "yes"})
226                author.update({"avatar_url": author_profile["logo"]["rootUrl"] + avatar_artifacts[0]["fileIdentifyingUrlPathSegment"]})
227
228        if post["actor"]["name"].get("attributesV2"):
229            pronouns = post["actor"]["name"]["attributesV2"][0]["detailData"].get("*profileFullName", {}).get("pronoun")
230            if pronouns:
231                if pronouns.get("customPronoun"):
232                    author.update({"pronouns": pronouns.get("customPronoun")})
233                elif pronouns.get("standardizedPronoun"):
234                    author.update({"pronouns": pronouns.get("standardizedPronoun")})
235
236        avatar = post["actor"]["image"].get("attributes", [{}])[0].get("detailData", {}).get("nonEntityProfilePicture")
237        if avatar and avatar["vectorImage"]:
238            author.update({"avatar_url": avatar["vectorImage"]["rootUrl"] + avatar["vectorImage"]["artifacts"][0]["fileIdentifyingUrlPathSegment"]})
239
240        return author
241
242    @staticmethod
243    def parse_time_ago(time_ago):
244        """
245        Attempt to parse a timestamp for a post
246
247        LinkedIn doesn't give us the actual timestamp, only a relative
248        indicator like "18h ago". This is annoying because it gets more
249        imprecise the longer ago it is, and because it is language-sensitive.
250        For example, in English 18 months is displayed as "18mo" but in Dutch
251        it is "18 mnd".
252
253        Right now this will only adjust the 'collected at' timestamp if the
254        data was scraped from an English or Dutch interface, and even then the
255        timestamps will still be imprecise.
256
257        :param str time_ago:  Relative timestamp, e.g. '18mo'.
258        :return int:  Estimated timestamp of post, as unix timestamp
259        """
260        time_ago = time_ago.split("•")[0]
261        numbers = re.sub(r"[^0-9]", "", time_ago).strip()
262        letters = re.sub(r"[0-9]", "", time_ago).strip()
263
264        period_lengths = {
265            "s": 1,
266            "m": 60,
267            "h": 3600,
268            "d": 86400,
269            "w": 7 * 86400,
270            "mo": 30.4375 * 86400,  # we don't know WHICH months, so use the average length of a month
271            "mnd": 30.4375 * 86400,
272            "yr": 365.25 * 86400,  # likewise
273            "j": 365.25 * 86400,
274        }
275
276        numbers = int(numbers) if len(numbers) else 0
277        return period_lengths.get(letters, 0) * numbers

class SearchLinkedIn(backend.lib.search.Search): View Source

 16class SearchLinkedIn(Search):
 17    """
 18    Import scraped LinkedIn data
 19    """
 20    type = "linkedin-search"  # job ID
 21    category = "Search"  # category
 22    title = "Import scraped LinkedIn data"  # title displayed in UI
 23    description = "Import LinkedIn data collected with an external tool such as Zeeschuimer."  # description displayed in UI
 24    extension = "ndjson"  # extension of result file, used internally and in UI
 25    is_from_zeeschuimer = True
 26
 27    # not available as a processor for existing datasets
 28    accepts = [None]
 29    references = [
 30        "[Zeeschuimer browser extension](https://github.com/digitalmethodsinitiative/zeeschuimer)",
 31        "[Worksheet: Capturing TikTok data with Zeeschuimer and 4CAT](https://tinyurl.com/nmrw-zeeschuimer-tiktok) (also explains general usage of Zeeschuimer)"
 32    ]
 33
 34    def get_items(self, query):
 35        """
 36        Run custom search
 37
 38        Not available for LinkedIn
 39        """
 40        raise NotImplementedError("LinkedIn datasets can only be created by importing data from elsewhere")
 41
 42    @staticmethod
 43    def map_item(item):
 44        """
 45        Parse LinkedIn post in Voyager V2 format
 46
 47        'Voyager V2' seems to be how the format is referred to in the data
 48        itself...
 49
 50        :param item:  Data as received from LinkedIn
 51        :return dict:  Mapped item
 52        """
 53
 54        # annoyingly, posts don't come with a timestamp
 55        # approximate it by using the time of collection and the "time ago"
 56        # included with the post (e.g. 'published 18h ago')
 57        if not item.get("actor"):
 58            return {}
 59
 60        if "__import_meta" in item:
 61            time_collected = int(item["__import_meta"]["timestamp_collected"] / 1000)  # milliseconds
 62        else:
 63            # best we got
 64            time_collected = int(time.time())
 65
 66        time_ago = item["actor"]["subDescription"]["text"] if item["actor"].get("subDescription") else ""
 67        timestamp = int(time_collected - SearchLinkedIn.parse_time_ago(time_ago))
 68
 69        # images are stored in some convoluted way
 70        # there are multiple URLs for various thumbnails, use the one for the
 71        # largest version of the image
 72        images = []
 73        if item["content"] and "images" in item["content"]:
 74            for image in item["content"]["images"]:
 75                image_data = image["attributes"][0]["vectorImage"]
 76                artifacts = sorted(image_data["artifacts"], key=lambda x: x["width"], reverse=True)
 77                url = image_data["rootUrl"] + artifacts[0]["fileIdentifyingUrlPathSegment"]
 78                images.append(url)
 79
 80        # or alternatively they are stored here:
 81        if not images and item["content"] and item["content"].get("articleComponent") and item["content"]["articleComponent"].get("largeImage"):
 82            image = item["content"]["articleComponent"]["largeImage"]["attributes"][0]["detailData"]["vectorImage"]
 83            if not image and item["content"]["articleComponent"]["largeImage"]["attributes"][0]["imageUrl"]:
 84                images.append(item["content"]["articleComponent"]["largeImage"]["attributes"][0]["imageUrl"]["url"])
 85            elif image and image.get("artifacts"):
 86                images.append(image["rootUrl"] + image["artifacts"][0]["fileIdentifyingUrlPathSegment"])
 87
 88        # video thumbnails are stored similarly as image data
 89        video_thumb_url = ""
 90        thumb_content = None
 91        if item["content"] and "*videoPlayMetadata" in item["content"]:
 92            thumb_content = item["content"]["*videoPlayMetadata"]["thumbnail"]
 93        elif item["content"] and "linkedInVideoComponent" in item["content"] and item["content"]["linkedInVideoComponent"]:
 94            thumb_content = item["content"]["linkedInVideoComponent"]["*videoPlayMetadata"]["thumbnail"]
 95        elif item["content"] and "externalVideoComponent" in item["content"] and item["content"]["externalVideoComponent"]:
 96            thumb_content = item["content"]["externalVideoComponent"]["*videoPlayMetadata"]["thumbnail"]
 97        if thumb_content:
 98            video_thumb_url = thumb_content["rootUrl"] + thumb_content["artifacts"][0]["fileIdentifyingUrlPathSegment"]
 99
100        author = SearchLinkedIn.get_author(item)
101
102        # the ID is in the format 'urn:li:activity:6960882777168695296'
103        # retain the numerical part as the item ID for 4CAT
104        # sometimes posts seem to be combined, e.g.:
105        # urn:li:aggregate:(urn:li:activity:3966023054712791616,urn:li:activity:3965915018238312449)
106        # effectively both IDs seem to refer to the same post, so just take the
107        # first one
108        meta_urn = item.get("updateMetadata", {"urn": item.get("preDashEntityUrn")})["urn"]
109        urn = "urn:li:activity:" + meta_urn.split("urn:li:activity:")[1].split(",")[0].split(")")[0]
110        item_id = urn.split(":").pop()
111
112        # the way hashtags were stored changed at some point
113        hashtags = []
114        if item["commentary"] and "attributes" in item["commentary"]["text"]:
115            hashtags = [tag["trackingUrn"].split(":").pop() for tag in item["commentary"]["text"].get("attributes", []) if tag["type"] == "HASHTAG"]
116        elif item["commentary"] and "attributesV2" in item["commentary"]["text"]:
117            hashtags = [tag["detailData"]["*hashtag"]["trackingUrn"].split(":").pop() for tag in item["commentary"]["text"].get("attributesV2", []) if "*hashtag" in tag["detailData"]]
118
119        # and mentions
120        # we're storing both usernames and full names
121        author_mentions = []
122        author_name_mentions = []
123        if item["commentary"] and "attributes" in item["commentary"]["text"]:
124            for mention in item["commentary"]["text"].get("attributes", {}):
125                if mention["type"] == "PROFILE_MENTION":
126                    mention = mention["*miniProfile"]
127                    author_mentions.append(mention["publicIdentifier"])         
128                    author_name_mentions.append(" ".join([mention.get("firstName", ""), mention.get("lastName", "")]))
129                elif mention["type"] == "COMPANY_NAME":
130                    mention = mention["*miniCompany"]
131                    author_mentions.append(mention["universalName"])         
132                    author_name_mentions.append(mention.get("name", ""))
133
134        # same for metrics
135        if "*totalSocialActivityCounts" in item["*socialDetail"]:
136            metrics = {
137                "comments": item["*socialDetail"]["*totalSocialActivityCounts"]["numComments"],
138                "shares": item["*socialDetail"]["*totalSocialActivityCounts"]["numShares"],
139                "reactions": item["*socialDetail"]["*totalSocialActivityCounts"]["numLikes"],
140                "reaction_like": 0,
141                "reaction_empathy": 0,
142                "reaction_praise": 0,
143                "reaction_entertainment": 0,
144                "reaction_appreciation": 0,
145                "reaction_interest": 0
146                }
147            # There's different kind of reaction metrics
148            for reaction_type in item["*socialDetail"]["*totalSocialActivityCounts"].get("reactionTypeCounts", []):
149                metrics["reaction_" + reaction_type["reactionType"].lower()] = reaction_type["count"]
150
151        else:
152            metrics = {
153                "comments": item["*socialDetail"]["comments"]["paging"]["total"],
154                "shares": item["*socialDetail"]["totalShares"],
155                "reactions": item["*socialDetail"]["likes"]["paging"]["total"]
156            }
157
158        # and links
159        link_url = ""
160        if item.get("content") and item["content"].get("navigationContext"):
161            link_url = item["content"]["navigationContext"].get("actionTarget", "")
162        elif item.get("content") and item["content"].get("articleComponent") and "navigationContext" in item["content"]["articleComponent"]:
163            link_url = item["content"]["articleComponent"]["navigationContext"].get("actionTarget", "")
164
165        return MappedItem({
166            "id": item_id,
167            "thread_id": item_id,
168            "body": item["commentary"]["text"]["text"] if item["commentary"] else "",
169            "timestamp": datetime.datetime.utcfromtimestamp(timestamp).strftime("%Y-%m-%d %H:%M:%S"),
170            "timestamp_collected": datetime.datetime.utcfromtimestamp(time_collected).strftime("%Y-%m-%d %H:%M:%S"),
171            "timestamp_ago": time_ago.split("•")[0].strip(),
172            "is_promoted": "yes" if not re.findall(r"[0-9]", time_ago) else "no",
173            **{("author_" + k).replace("_username", ""): v for k, v in author.items()},
174            "author_mentions": ",".join(author_mentions),
175            "author_name_mentions": ",".join(author_name_mentions),
176            "hashtags": ",".join(hashtags),
177            "image_urls": ",".join(images),
178            "video_thumb_url": video_thumb_url,
179            "post_url": "https://www.linkedin.com/feed/update/" + urn,
180            "link_url":  link_url,
181            **metrics,
182            "inclusion_context": item["header"]["text"]["text"] if item.get("header") else "",
183            "unix_timestamp": timestamp,
184            "unix_timestamp_collected": time_collected
185        })
186
187    @staticmethod
188    def get_author(post):
189        """
190        Extract author information from post
191
192        This is a bit complicated because it works differently for companies
193        and users and some fields are not always present. Hence, a separate
194        method.
195
196        :param dict post:  Post data
197        :return dict:  Author information
198        """
199        author = {
200            "username": post["actor"]["navigationContext"]["actionTarget"].split("linkedin.com/").pop().split("?")[0],
201            "name": post["actor"]["name"]["text"],
202            "description": post["actor"].get("description", {}).get("text", "") if post["actor"].get("description") else "",
203            "pronouns": "",
204            "avatar_url": "",
205            "is_company": "no",
206            "url": post["actor"]["navigationContext"]["actionTarget"].split("?")[0],
207        }
208
209        # likewise for author avatars
210        if post["actor"]["name"].get("attributes"):
211            if "*miniProfile" in post["actor"]["name"]["attributes"][0]:
212                author_profile = post["actor"]["name"]["attributes"][0]["*miniProfile"]
213                if author_profile["picture"]:
214                    avatar_artifacts = sorted(author_profile["picture"]["artifacts"], key=lambda x: x["width"], reverse=True)
215                    author.update({"avatar_url": author_profile["picture"]["rootUrl"] + avatar_artifacts[0]["fileIdentifyingUrlPathSegment"]})
216
217                if author_profile.get("customPronoun"):
218                    author.update({"pronouns": author_profile.get("customPronoun")})
219                elif author_profile.get("standardizedPronoun"):
220                    author.update({"pronouns": author_profile.get("standardizedPronoun").lower()})
221
222            elif "*miniCompany" in post["actor"]["name"]["attributes"][0]:
223                author_profile = post["actor"]["name"]["attributes"][0]["*miniCompany"]
224                avatar_artifacts = sorted(author_profile["logo"]["artifacts"], key=lambda x: x["width"], reverse=True)
225
226                author.update({"is_company": "yes"})
227                author.update({"avatar_url": author_profile["logo"]["rootUrl"] + avatar_artifacts[0]["fileIdentifyingUrlPathSegment"]})
228
229        if post["actor"]["name"].get("attributesV2"):
230            pronouns = post["actor"]["name"]["attributesV2"][0]["detailData"].get("*profileFullName", {}).get("pronoun")
231            if pronouns:
232                if pronouns.get("customPronoun"):
233                    author.update({"pronouns": pronouns.get("customPronoun")})
234                elif pronouns.get("standardizedPronoun"):
235                    author.update({"pronouns": pronouns.get("standardizedPronoun")})
236
237        avatar = post["actor"]["image"].get("attributes", [{}])[0].get("detailData", {}).get("nonEntityProfilePicture")
238        if avatar and avatar["vectorImage"]:
239            author.update({"avatar_url": avatar["vectorImage"]["rootUrl"] + avatar["vectorImage"]["artifacts"][0]["fileIdentifyingUrlPathSegment"]})
240
241        return author
242
243    @staticmethod
244    def parse_time_ago(time_ago):
245        """
246        Attempt to parse a timestamp for a post
247
248        LinkedIn doesn't give us the actual timestamp, only a relative
249        indicator like "18h ago". This is annoying because it gets more
250        imprecise the longer ago it is, and because it is language-sensitive.
251        For example, in English 18 months is displayed as "18mo" but in Dutch
252        it is "18 mnd".
253
254        Right now this will only adjust the 'collected at' timestamp if the
255        data was scraped from an English or Dutch interface, and even then the
256        timestamps will still be imprecise.
257
258        :param str time_ago:  Relative timestamp, e.g. '18mo'.
259        :return int:  Estimated timestamp of post, as unix timestamp
260        """
261        time_ago = time_ago.split("•")[0]
262        numbers = re.sub(r"[^0-9]", "", time_ago).strip()
263        letters = re.sub(r"[0-9]", "", time_ago).strip()
264
265        period_lengths = {
266            "s": 1,
267            "m": 60,
268            "h": 3600,
269            "d": 86400,
270            "w": 7 * 86400,
271            "mo": 30.4375 * 86400,  # we don't know WHICH months, so use the average length of a month
272            "mnd": 30.4375 * 86400,
273            "yr": 365.25 * 86400,  # likewise
274            "j": 365.25 * 86400,
275        }
276
277        numbers = int(numbers) if len(numbers) else 0
278        return period_lengths.get(letters, 0) * numbers

Import scraped LinkedIn data

type = 'linkedin-search'

category = 'Search'

title = 'Import scraped LinkedIn data'

description = 'Import LinkedIn data collected with an external tool such as Zeeschuimer.'

extension = 'ndjson'

is_from_zeeschuimer = True

accepts = [None]

references = ['[Zeeschuimer browser extension](https://github.com/digitalmethodsinitiative/zeeschuimer)', '[Worksheet: Capturing TikTok data with Zeeschuimer and 4CAT](https://tinyurl.com/nmrw-zeeschuimer-tiktok) (also explains general usage of Zeeschuimer)']

def get_items(self, query): View Source

34    def get_items(self, query):
35        """
36        Run custom search
37
38        Not available for LinkedIn
39        """
40        raise NotImplementedError("LinkedIn datasets can only be created by importing data from elsewhere")

Run custom search

Not available for LinkedIn

@staticmethod

def map_item(item): View Source

 42    @staticmethod
 43    def map_item(item):
 44        """
 45        Parse LinkedIn post in Voyager V2 format
 46
 47        'Voyager V2' seems to be how the format is referred to in the data
 48        itself...
 49
 50        :param item:  Data as received from LinkedIn
 51        :return dict:  Mapped item
 52        """
 53
 54        # annoyingly, posts don't come with a timestamp
 55        # approximate it by using the time of collection and the "time ago"
 56        # included with the post (e.g. 'published 18h ago')
 57        if not item.get("actor"):
 58            return {}
 59
 60        if "__import_meta" in item:
 61            time_collected = int(item["__import_meta"]["timestamp_collected"] / 1000)  # milliseconds
 62        else:
 63            # best we got
 64            time_collected = int(time.time())
 65
 66        time_ago = item["actor"]["subDescription"]["text"] if item["actor"].get("subDescription") else ""
 67        timestamp = int(time_collected - SearchLinkedIn.parse_time_ago(time_ago))
 68
 69        # images are stored in some convoluted way
 70        # there are multiple URLs for various thumbnails, use the one for the
 71        # largest version of the image
 72        images = []
 73        if item["content"] and "images" in item["content"]:
 74            for image in item["content"]["images"]:
 75                image_data = image["attributes"][0]["vectorImage"]
 76                artifacts = sorted(image_data["artifacts"], key=lambda x: x["width"], reverse=True)
 77                url = image_data["rootUrl"] + artifacts[0]["fileIdentifyingUrlPathSegment"]
 78                images.append(url)
 79
 80        # or alternatively they are stored here:
 81        if not images and item["content"] and item["content"].get("articleComponent") and item["content"]["articleComponent"].get("largeImage"):
 82            image = item["content"]["articleComponent"]["largeImage"]["attributes"][0]["detailData"]["vectorImage"]
 83            if not image and item["content"]["articleComponent"]["largeImage"]["attributes"][0]["imageUrl"]:
 84                images.append(item["content"]["articleComponent"]["largeImage"]["attributes"][0]["imageUrl"]["url"])
 85            elif image and image.get("artifacts"):
 86                images.append(image["rootUrl"] + image["artifacts"][0]["fileIdentifyingUrlPathSegment"])
 87
 88        # video thumbnails are stored similarly as image data
 89        video_thumb_url = ""
 90        thumb_content = None
 91        if item["content"] and "*videoPlayMetadata" in item["content"]:
 92            thumb_content = item["content"]["*videoPlayMetadata"]["thumbnail"]
 93        elif item["content"] and "linkedInVideoComponent" in item["content"] and item["content"]["linkedInVideoComponent"]:
 94            thumb_content = item["content"]["linkedInVideoComponent"]["*videoPlayMetadata"]["thumbnail"]
 95        elif item["content"] and "externalVideoComponent" in item["content"] and item["content"]["externalVideoComponent"]:
 96            thumb_content = item["content"]["externalVideoComponent"]["*videoPlayMetadata"]["thumbnail"]
 97        if thumb_content:
 98            video_thumb_url = thumb_content["rootUrl"] + thumb_content["artifacts"][0]["fileIdentifyingUrlPathSegment"]
 99
100        author = SearchLinkedIn.get_author(item)
101
102        # the ID is in the format 'urn:li:activity:6960882777168695296'
103        # retain the numerical part as the item ID for 4CAT
104        # sometimes posts seem to be combined, e.g.:
105        # urn:li:aggregate:(urn:li:activity:3966023054712791616,urn:li:activity:3965915018238312449)
106        # effectively both IDs seem to refer to the same post, so just take the
107        # first one
108        meta_urn = item.get("updateMetadata", {"urn": item.get("preDashEntityUrn")})["urn"]
109        urn = "urn:li:activity:" + meta_urn.split("urn:li:activity:")[1].split(",")[0].split(")")[0]
110        item_id = urn.split(":").pop()
111
112        # the way hashtags were stored changed at some point
113        hashtags = []
114        if item["commentary"] and "attributes" in item["commentary"]["text"]:
115            hashtags = [tag["trackingUrn"].split(":").pop() for tag in item["commentary"]["text"].get("attributes", []) if tag["type"] == "HASHTAG"]
116        elif item["commentary"] and "attributesV2" in item["commentary"]["text"]:
117            hashtags = [tag["detailData"]["*hashtag"]["trackingUrn"].split(":").pop() for tag in item["commentary"]["text"].get("attributesV2", []) if "*hashtag" in tag["detailData"]]
118
119        # and mentions
120        # we're storing both usernames and full names
121        author_mentions = []
122        author_name_mentions = []
123        if item["commentary"] and "attributes" in item["commentary"]["text"]:
124            for mention in item["commentary"]["text"].get("attributes", {}):
125                if mention["type"] == "PROFILE_MENTION":
126                    mention = mention["*miniProfile"]
127                    author_mentions.append(mention["publicIdentifier"])         
128                    author_name_mentions.append(" ".join([mention.get("firstName", ""), mention.get("lastName", "")]))
129                elif mention["type"] == "COMPANY_NAME":
130                    mention = mention["*miniCompany"]
131                    author_mentions.append(mention["universalName"])         
132                    author_name_mentions.append(mention.get("name", ""))
133
134        # same for metrics
135        if "*totalSocialActivityCounts" in item["*socialDetail"]:
136            metrics = {
137                "comments": item["*socialDetail"]["*totalSocialActivityCounts"]["numComments"],
138                "shares": item["*socialDetail"]["*totalSocialActivityCounts"]["numShares"],
139                "reactions": item["*socialDetail"]["*totalSocialActivityCounts"]["numLikes"],
140                "reaction_like": 0,
141                "reaction_empathy": 0,
142                "reaction_praise": 0,
143                "reaction_entertainment": 0,
144                "reaction_appreciation": 0,
145                "reaction_interest": 0
146                }
147            # There's different kind of reaction metrics
148            for reaction_type in item["*socialDetail"]["*totalSocialActivityCounts"].get("reactionTypeCounts", []):
149                metrics["reaction_" + reaction_type["reactionType"].lower()] = reaction_type["count"]
150
151        else:
152            metrics = {
153                "comments": item["*socialDetail"]["comments"]["paging"]["total"],
154                "shares": item["*socialDetail"]["totalShares"],
155                "reactions": item["*socialDetail"]["likes"]["paging"]["total"]
156            }
157
158        # and links
159        link_url = ""
160        if item.get("content") and item["content"].get("navigationContext"):
161            link_url = item["content"]["navigationContext"].get("actionTarget", "")
162        elif item.get("content") and item["content"].get("articleComponent") and "navigationContext" in item["content"]["articleComponent"]:
163            link_url = item["content"]["articleComponent"]["navigationContext"].get("actionTarget", "")
164
165        return MappedItem({
166            "id": item_id,
167            "thread_id": item_id,
168            "body": item["commentary"]["text"]["text"] if item["commentary"] else "",
169            "timestamp": datetime.datetime.utcfromtimestamp(timestamp).strftime("%Y-%m-%d %H:%M:%S"),
170            "timestamp_collected": datetime.datetime.utcfromtimestamp(time_collected).strftime("%Y-%m-%d %H:%M:%S"),
171            "timestamp_ago": time_ago.split("•")[0].strip(),
172            "is_promoted": "yes" if not re.findall(r"[0-9]", time_ago) else "no",
173            **{("author_" + k).replace("_username", ""): v for k, v in author.items()},
174            "author_mentions": ",".join(author_mentions),
175            "author_name_mentions": ",".join(author_name_mentions),
176            "hashtags": ",".join(hashtags),
177            "image_urls": ",".join(images),
178            "video_thumb_url": video_thumb_url,
179            "post_url": "https://www.linkedin.com/feed/update/" + urn,
180            "link_url":  link_url,
181            **metrics,
182            "inclusion_context": item["header"]["text"]["text"] if item.get("header") else "",
183            "unix_timestamp": timestamp,
184            "unix_timestamp_collected": time_collected
185        })

Parse LinkedIn post in Voyager V2 format

'Voyager V2' seems to be how the format is referred to in the data itself...

Parameters

item: Data as received from LinkedIn

Returns

Mapped item

@staticmethod

def get_author(post): View Source

187    @staticmethod
188    def get_author(post):
189        """
190        Extract author information from post
191
192        This is a bit complicated because it works differently for companies
193        and users and some fields are not always present. Hence, a separate
194        method.
195
196        :param dict post:  Post data
197        :return dict:  Author information
198        """
199        author = {
200            "username": post["actor"]["navigationContext"]["actionTarget"].split("linkedin.com/").pop().split("?")[0],
201            "name": post["actor"]["name"]["text"],
202            "description": post["actor"].get("description", {}).get("text", "") if post["actor"].get("description") else "",
203            "pronouns": "",
204            "avatar_url": "",
205            "is_company": "no",
206            "url": post["actor"]["navigationContext"]["actionTarget"].split("?")[0],
207        }
208
209        # likewise for author avatars
210        if post["actor"]["name"].get("attributes"):
211            if "*miniProfile" in post["actor"]["name"]["attributes"][0]:
212                author_profile = post["actor"]["name"]["attributes"][0]["*miniProfile"]
213                if author_profile["picture"]:
214                    avatar_artifacts = sorted(author_profile["picture"]["artifacts"], key=lambda x: x["width"], reverse=True)
215                    author.update({"avatar_url": author_profile["picture"]["rootUrl"] + avatar_artifacts[0]["fileIdentifyingUrlPathSegment"]})
216
217                if author_profile.get("customPronoun"):
218                    author.update({"pronouns": author_profile.get("customPronoun")})
219                elif author_profile.get("standardizedPronoun"):
220                    author.update({"pronouns": author_profile.get("standardizedPronoun").lower()})
221
222            elif "*miniCompany" in post["actor"]["name"]["attributes"][0]:
223                author_profile = post["actor"]["name"]["attributes"][0]["*miniCompany"]
224                avatar_artifacts = sorted(author_profile["logo"]["artifacts"], key=lambda x: x["width"], reverse=True)
225
226                author.update({"is_company": "yes"})
227                author.update({"avatar_url": author_profile["logo"]["rootUrl"] + avatar_artifacts[0]["fileIdentifyingUrlPathSegment"]})
228
229        if post["actor"]["name"].get("attributesV2"):
230            pronouns = post["actor"]["name"]["attributesV2"][0]["detailData"].get("*profileFullName", {}).get("pronoun")
231            if pronouns:
232                if pronouns.get("customPronoun"):
233                    author.update({"pronouns": pronouns.get("customPronoun")})
234                elif pronouns.get("standardizedPronoun"):
235                    author.update({"pronouns": pronouns.get("standardizedPronoun")})
236
237        avatar = post["actor"]["image"].get("attributes", [{}])[0].get("detailData", {}).get("nonEntityProfilePicture")
238        if avatar and avatar["vectorImage"]:
239            author.update({"avatar_url": avatar["vectorImage"]["rootUrl"] + avatar["vectorImage"]["artifacts"][0]["fileIdentifyingUrlPathSegment"]})
240
241        return author

Extract author information from post

This is a bit complicated because it works differently for companies and users and some fields are not always present. Hence, a separate method.

Parameters

dict post: Post data

Returns

Author information

@staticmethod

def parse_time_ago(time_ago): View Source

243    @staticmethod
244    def parse_time_ago(time_ago):
245        """
246        Attempt to parse a timestamp for a post
247
248        LinkedIn doesn't give us the actual timestamp, only a relative
249        indicator like "18h ago". This is annoying because it gets more
250        imprecise the longer ago it is, and because it is language-sensitive.
251        For example, in English 18 months is displayed as "18mo" but in Dutch
252        it is "18 mnd".
253
254        Right now this will only adjust the 'collected at' timestamp if the
255        data was scraped from an English or Dutch interface, and even then the
256        timestamps will still be imprecise.
257
258        :param str time_ago:  Relative timestamp, e.g. '18mo'.
259        :return int:  Estimated timestamp of post, as unix timestamp
260        """
261        time_ago = time_ago.split("•")[0]
262        numbers = re.sub(r"[^0-9]", "", time_ago).strip()
263        letters = re.sub(r"[0-9]", "", time_ago).strip()
264
265        period_lengths = {
266            "s": 1,
267            "m": 60,
268            "h": 3600,
269            "d": 86400,
270            "w": 7 * 86400,
271            "mo": 30.4375 * 86400,  # we don't know WHICH months, so use the average length of a month
272            "mnd": 30.4375 * 86400,
273            "yr": 365.25 * 86400,  # likewise
274            "j": 365.25 * 86400,
275        }
276
277        numbers = int(numbers) if len(numbers) else 0
278        return period_lengths.get(letters, 0) * numbers

Attempt to parse a timestamp for a post

LinkedIn doesn't give us the actual timestamp, only a relative indicator like "18h ago". This is annoying because it gets more imprecise the longer ago it is, and because it is language-sensitive. For example, in English 18 months is displayed as "18mo" but in Dutch it is "18 mnd".

Right now this will only adjust the 'collected at' timestamp if the data was scraped from an English or Dutch interface, and even then the timestamps will still be imprecise.

Parameters

str time_ago: Relative timestamp, e.g. '18mo'.

Returns

Estimated timestamp of post, as unix timestamp

Inherited Members

backend.lib.worker.BasicWorker: BasicWorker; INTERRUPT_NONE; INTERRUPT_RETRY; INTERRUPT_CANCEL; queue; log; manager; interrupted; modules; init_time; name; run; clean_up; request_interrupt; is_4cat_class
backend.lib.search.Search: max_workers; prefix; return_cols; import_error_count; import_warning_count; process; search; import_from_file; items_to_csv; items_to_ndjson; items_to_archive
backend.lib.processor.BasicProcessor: db; job; dataset; owner; source_dataset; source_file; config; is_running_in_preset; is_hidden; filepath; work; after_process; remove_files; abort; iterate_proxied_requests; push_proxied_request; flush_proxied_requests; iterate_archive_contents; unpack_archive_contents; extract_archived_file_by_name; write_csv_items_and_finish; write_archive_and_finish; create_standalone; save_annotations; map_item_method_available; get_mapped_item; is_filter; get_options; get_status; is_top_dataset; is_from_collector; get_extension; is_rankable; exclude_followup_processors; is_4cat_processor