1
00:00:00,240 --> 00:00:01,280
Welcome back.

2
00:00:01,290 --> 00:00:03,480
Let's continue with our e-mail scraper.

3
00:00:03,480 --> 00:00:06,290
So in the previous video we created the needed variables.

4
00:00:06,300 --> 00:00:07,860
We created two different sets.

5
00:00:07,890 --> 00:00:09,630
One of them was store scraped you or else.

6
00:00:09,630 --> 00:00:11,730
And one of them we stole e-mails.

7
00:00:11,790 --> 00:00:17,310
Then we told the program that we only want to scrape for first 100 you are out of that it finds.

8
00:00:17,310 --> 00:00:22,590
And now we want to tell the program that all the e-mails that we find in those 100 or else we wanted

9
00:00:22,590 --> 00:00:23,160
to save.

10
00:00:23,160 --> 00:00:24,660
And later on printed to the screen.

11
00:00:25,640 --> 00:00:29,820
So first thing we need to do is we need to form a view URL and how we're going to do that.

12
00:00:29,850 --> 00:00:39,300
Well let's create a variable called parts that variable will be equal to your l lib dot pass and then

13
00:00:39,300 --> 00:00:42,750
you are all split from the world that we have.

14
00:00:42,750 --> 00:00:45,840
And keep in mind this is the URL that we are currently scanning.

15
00:00:45,840 --> 00:00:47,460
And it will switch every time.

16
00:00:47,460 --> 00:00:50,970
That's why we have these count variable right here.

17
00:00:51,060 --> 00:00:52,830
We need to define the base url.

18
00:00:54,330 --> 00:00:56,300
And here we can define it like this.

19
00:00:56,350 --> 00:01:05,380
The open these curly brackets zero dot sheep to dot slash slash which is the regular slash slash in

20
00:01:05,380 --> 00:01:13,530
the link and then open curly brackets once again zero dot net block and just here we need to change

21
00:01:13,530 --> 00:01:16,860
this the code square brackets should be after shame.

22
00:01:16,860 --> 00:01:25,250
So right here close the square brackets and then at the end we want to format this with the parts that

23
00:01:25,250 --> 00:01:27,020
we created which is right here.

24
00:01:27,740 --> 00:01:29,740
So we're simply just splitting the you are out.

25
00:01:29,750 --> 00:01:31,710
That is all that we are doing.

26
00:01:31,830 --> 00:01:35,690
And right now let's find the path so path will be equal to your URL.

27
00:01:36,590 --> 00:01:39,620
And here we are going to use the list comprehension.

28
00:01:39,620 --> 00:01:48,810
So we are going to do something like this from the beginning to the URL dot our find open brackets.

29
00:01:48,810 --> 00:01:57,510
We're looking for the slash and then plus one at the end and then close the square brackets.

30
00:01:57,550 --> 00:02:07,240
If open up single quotes once again add slash in between in parts dot path.

31
00:02:07,570 --> 00:02:08,740
OK.

32
00:02:08,870 --> 00:02:13,800
In any other case so else we're going to simply use the URL after it.

33
00:02:13,820 --> 00:02:17,490
We can print that are processing specified that you are L.

34
00:02:17,780 --> 00:02:18,110
OK.

35
00:02:18,140 --> 00:02:23,270
So percent D we are going to use in order to print the number of the URL that we're currently scanning

36
00:02:23,270 --> 00:02:30,730
so let's type here processing percent s for you are out to be fit in there.

37
00:02:30,730 --> 00:02:40,670
So percent instead of percent the we want to type count and instead of percent s we want to specify

38
00:02:40,820 --> 00:02:42,430
the U.R.L..

39
00:02:42,860 --> 00:02:45,440
This will simply just print us to the screen which you are.

40
00:02:45,490 --> 00:02:51,600
Are we currently scanning and which number is that your l from 1 to 100.

41
00:02:51,830 --> 00:02:52,510
Right after it.

42
00:02:52,520 --> 00:02:59,560
We want to try to connect to that your elsewhere will simply try requests that get to the specific URL

43
00:02:59,750 --> 00:03:05,270
and we will store that inside of our response because from that response we're going to try and find

44
00:03:05,600 --> 00:03:10,580
all of the emails in case that doesn't work we are going to print some errors except

45
00:03:13,290 --> 00:03:20,380
open brackets requests dot exceptions dot missing Shima.

46
00:03:20,390 --> 00:03:22,340
That is the first error that we might encounter.

47
00:03:22,370 --> 00:03:29,480
And the second one is requests that exceptions that connection error.

48
00:03:29,510 --> 00:03:33,760
In case we don't manage to connect in case we don't meet.

49
00:03:34,010 --> 00:03:36,590
And in that case we simply just want to continue.

50
00:03:36,590 --> 00:03:41,050
We don't want to stop running our program just because we didn't manage to connect to one.

51
00:03:41,060 --> 00:03:45,560
You are well we want to proceed to the next year out and right after it.

52
00:03:45,590 --> 00:03:49,310
Let's assume that we got the response and we started inside of a response.

53
00:03:49,310 --> 00:03:55,890
Now we need to use regex in order to find all of the emails inside of that response so let's create

54
00:03:55,890 --> 00:04:03,690
a variable called New underscore emails to be equal to the set so it will be a set of r e that find

55
00:04:03,690 --> 00:04:06,990
all and we already know what our either final does.

56
00:04:06,990 --> 00:04:07,990
We used it before.

57
00:04:08,130 --> 00:04:13,500
It will simply just find all the strings with a specified pattern in between the brackets and the pattern

58
00:04:13,560 --> 00:04:14,270
will be this.

59
00:04:14,280 --> 00:04:16,850
So I will just type it out and you can simply just scope it.

60
00:04:18,370 --> 00:04:22,710
OK so this is the pattern that we are going to use in order to find all of the emails.

61
00:04:22,710 --> 00:04:29,070
Here is the ad sign from the email we are searching for anything before the ad sign and anything after

62
00:04:29,070 --> 00:04:29,930
the ad sign.

63
00:04:31,320 --> 00:04:34,470
And now we need to specify where we are searching this pattern in.

64
00:04:34,590 --> 00:04:39,240
Well we want to search it in our response and how can we get the response printed in text.

65
00:04:39,240 --> 00:04:50,300
Well we can simply specify response dot text and at the end R E dot I simply just ignoring the case

66
00:04:50,470 --> 00:04:54,330
closed another bracket and this should manage to find our emails.

67
00:04:54,350 --> 00:04:59,960
Now we can update the email so use emails not update and remember emails is set that we created at the

68
00:04:59,960 --> 00:05:02,840
beginning of the program which is currently empty.

69
00:05:02,870 --> 00:05:07,250
So we are going to update it with the new emails.

70
00:05:07,350 --> 00:05:08,110
All right.

71
00:05:08,150 --> 00:05:13,400
So now we see that we have some rent underlines right here which is from the R E library.

72
00:05:14,300 --> 00:05:17,870
So let's see why do we get that.

73
00:05:17,870 --> 00:05:24,670
Let's go right here and Page 3 install R E in case it is not installed.

74
00:05:26,190 --> 00:05:31,580
Let me just see right here maybe we need to specify double quotes like this.

75
00:05:31,780 --> 00:05:37,330
We're going to leave it like that and see what is the error once we run the program for now and I can't

76
00:05:37,330 --> 00:05:38,640
seem to find it.

77
00:05:38,830 --> 00:05:44,740
So no worries for now when we manage to create the pattern and we updated our email set with all the

78
00:05:44,740 --> 00:05:50,050
e-mails that we found in this current thing that we are scanning what we want to do in the next video

79
00:05:50,170 --> 00:05:54,260
we want to proceed to the next thing and then do this all over again.

80
00:05:54,430 --> 00:05:59,950
And at the end of scanning all 100 links we want to print all of the emails that we found and that we

81
00:05:59,950 --> 00:06:01,360
stored inside of this set.

82
00:06:01,710 --> 00:06:02,040
OK.

83
00:06:02,830 --> 00:06:06,610
So thank you for watching this lecture and I will see you in the next sartorial by.
