About a month ago we
Netpeak Spider is becoming a real ‘machine for search engine optimization’, so we thought: why not call it an 'SEO Terminator'.
Meet Netpeak Spider 2.1 – a program aimed
1. Classification of issues
The detection of more than 50 types of issues was realized in the new version, and we prioritized them as follows:
- Error → critical issues
- Warning → important, but not critical issues
- Notice → issues you should pay attention to
Now, on the right side of the program you can see the ‘Issues’ panel – it is the place where all the issues found during the crawling are presented.
The issues list will be constantly enlarged and modified. However, now it looks like that:
|Duplicate Pages*||Indicates all pages that have the same page hash output value. URLs in this report are grouped by page hash|
|Duplicate Body Content*||Indicates all pages that have the same page hash output value of the <body> section. URLs in this report are grouped by page body hash|
|Duplicate Titles*||Indicates all pages with title tags that appear on more than one page of the crawled website. URLs in this report are grouped by title tag|
|Missing or Empty Title||Indicates all pages without the title tag or with the empty one|
|Duplicate Descriptions*||Indicates all pages with meta description tags that appear on more than one page of the crawled website. URLs in this report are grouped by meta description tag|
|Missing or Empty Description||Indicates all pages without meta description tag or with the empty one|
|4xx Error Pages: Client Error||Indicates all pages that return 4xx HTTP status code|
|Redirect to 4xx Error Page||Indicates all pages that redirect to 4xx error pages like 404 Not Found Error|
|Endless Redirect||Indicates all pages that are redirecting to themselves and thereby generate infinite redirect loop|
|Max Redirections||Indicates all pages that redirect more than 4 times (by default). Notice that you can change the maximum number of redirects in the 'Restriction' tab of Crawling Settings|
|Connection Error||Indicates all pages that failed to respond as a result of connection error|
|Max URL Length||Indicates all pages with more than 2000 characters in URL|
|Missing Internal Links||Indicates all pages with no internal links. Notice that such pages get link juice but do not pass it|
|Broken Images||Indicates images that return 4xx-5xx status code. Notice that 'Images' Content Type should be checked in the 'General' tab of Crawling Settings to enable this issue detection|
|Multiple Titles||Indicates all pages with more than one title tag|
|Multiple Descriptions||Indicates all pages with more than one meta description tag|
|Missing or Empty h1||Indicates all pages without h1 header tag or with the empty one|
|Multiple h1||Indicates all pages with more than one h1 header tag|
|Duplicate h1*||Indicates all pages with h1 header tags that appear on more than one page of the crawled website. URLs in this report are grouped by h1 header tag value|
|Duplicate Canonical URLs*||Indicates all pages with Canonical URLs that appear on more than one page of the crawled website. URLs in this report are grouped by Canonical URL|
|Min Content Size||Indicates all pages with less than 500 characters in the <body> section (excluding HTML tags)|
|3xx Redirected Pages||Indicates all pages that return 3xx redirection status code|
|Non-301 Redirects||Indicates all pages that return redirection status code different from 301 (permanent redirect)|
|Redirect Chain||Indicates all pages that redirect more than 1 time|
|Meta Refresh Redirected||Indicates all pages with redirect in <meta http-equip="refresh"> tag in the <head> section|
|Blocked by Robots.txt||Indicates all pages that are disallowed in robots.txt file|
|Blocked by Meta Robots||Indicates all pages that contain <meta name="robots" content="
|Blocked by X-Robots-Tag||Indicates all pages that contain 'noindex' directive in X-Robots-Tag of the HTTP header response|
|Internal Nofollowed Links||Indicates all pages that contain internal links with rel="nofollow" attribute|
|Missing Images ALT Attributes||Indicates all pages that contain images without the alt attribute. To view the report, please click 'Current Table Summary' button, choose 'Images' and set the appropriate filter (Include → URLs with issue → Missing Images ALT Attributes)|
|Max Image Size||Indicates images which size exceeds 100 kBs. Notice that 'Images' box should be checked in the 'General' tab of Crawling Settings to enable this issue detection|
|5xx Error Pages: Server Error||Indicates all pages that return 5xx HTTP status code|
|Long Server Response Time||Indicates all pages with the response time of more than 500 ms|
|Other Failed URLs||Indicates all pages that failed to respond as a result of other unknown errors|
|Same Title and h1||Indicates all pages that have identical title and h1 header tags|
|Max Title Length||Indicates all pages with the title tag of more than 70 characters|
|Short Title||Indicates all pages with the title tag of less than 10 characters|
|Max Description Length||Indicates all pages with meta description tag of more than 160 characters|
|Short Description||Indicates all pages with meta description tag of less than 50 characters|
|Max h1 Length||Indicates all pages with h1 header tag of more than 65 characters|
|Max HTML Size||Indicates all pages with more than 200k characters in the <html> section (including HTML tags)|
|Max Content Size||Indicates all pages with more than 50k characters in the <body> section (excluding HTML tags)|
|Min Text/HTML Ratio||Indicates all pages with less than 10 percent of the text to HTML ratio|
|Indicates all pages that contain <meta name="robots" content="
|Nofollowed by X-Robots-Tag||Indicates all pages that contain 'nofollow' directive in X-Robots-Tag of the HTTP header response|
|Missing or Empty Canonical Tag||Indicates all pages without Canonical URL or with the empty one|
|Different Page URL and Canonical URL||Indicates all pages where the Canonical URL differs from the Page URL|
|Max Internal Links||Indicates all pages with more than 100 internal links|
|Max External Links||Indicates all pages with more than 10 external links|
|External Nofollowed Links||Indicates all pages that contain external links with rel="
|Missing or Empty Robots.txt File||Indicates all URLs related to missing or empty robots.txt file. Notice that different subdomains can contain different robots.txt files|
*Good news: all duplicates search is being carried out in real time, which means that you don't have to call a separate tool for it → choose the necessary parameters, start crawling and enjoy! :)
To better understand the issues, put your cursor over the precise issue and view a tooltip. Notice that all the issues, which are not found in the moment of crawling, are stored in the bottom part of the new panel, in the ‘Not Detected Issues’ block. Those issues, whose detection is off are stored even lower, in the ‘Disabled Issues’.
2. New parameters and option for their selection
The new version provides an option to choose the particular parameters for crawling. That directly influences the crawling speed and the RAM consumption. For example, such parameters as Links, Redirects, Headers, and Images are resource-intensive (it is mentioned in their settings) – try to switch them off if you don’t need them in the current crawling.
Altogether there are 24 new parameters added to Netpeak Spider 2.1. They are:
|Issues||Number of all issues (errors, warnings, and notices) found on the target URL|
|X-Robots-Tag Instructions||Content of the 'X-Robots-Tag' in HTTP response header: contains instructions for search engine robots and is similar to Meta Robots tag in the <head> section|
|Response Time||Time (in milliseconds) taken for a website server to respond to a user's or visitor's request. It is the same as Time To First Byte (TTFB)|
|Content Download Time||Time (in milliseconds) taken for a website server to return an HTML code of the page|
|Redirect Target URL||Target URL of single redirect or redirect chain if it exists|
|Content-Length||Content of the field 'Content-Length' in HTTP response headers; used to indicate the response body length in octets (8-bit bytes)|
|Content-Encoding||Content of the field 'Content-Encoding' in HTTP response headers; used to indicate the type of data encoding|
|Parameters in <head> Tags|
|Meta Refresh||Content of the <meta http-equiv="refresh"> tag in the <head> section of the document.|
|Rel Next/Prev URL||Content of <link rel="next" /> and <link rel="prev" /> tags, used to indicate the relationship between component URLs in paginated series|
|h1 Value||Content of the first non-empty <h1> tag on the target URL|
|h1 Length||Number of characters in the first non-empty <h1> tag on the target URL|
|h2-h6 Headers||Number, value and length of h2-h6 headers on the target URL: these parameters are disabled by default, however you can set their analysis if needed|
|HTML Size||Number of characters in the <html> section of the target page including HTML tags|
|Content Size||Number of characters (including spaces) in the <body> section of the target page excluding HTML tags|
|Text/HTML Ratio||Percentage of the text content on the target page, rounded to the nearest integer|
|Characters||Number of characters (excluding spaces) in the <body> section of the target page excluding HTML tags|
|Words||Number of words in the <body> section of the target page|
|Characters in <p>||Number of characters (excluding spaces) in <p> </p> tags in the <body> section of the target page|
|Words in <p>||Number of words in <p> </p> tags in the <body> section of the target page|
|Page Body Hash||Unique key of the page <body> section calculated using SHA1 algorithm|
|Images||Number of images found in <img> tags on the target page. Also you can find images alt attributes and URL source view, linking to the images.|
All the issues are directly connected to the parameters where such issues can be detected. For instance, to check whether the <title> tags on the website are implemented correctly, you need to select Title parameter in the ‘Parameters’ tab of the crawling settings.
3. New logic of working with the results
Such a big amount of information has to be included in this section, that we had to resort to lists inside the lists :) So, let’s go.
3.1. Completely new results table
We’ve integrated completely new results table into Netpeak Spider 2.1, and we hope you’ll enjoy the features listed below:
It doesn’t actually matter how many results you have in the new table – one hundred or one million. You’ll be really surprised by the table response time, sometimes even doubting whether you managed to scroll to the right place so quickly :) In short, we did our best to provide you with better user experience and we’d be more than happy to hear your feedback.
Now you can group data by any parameter in any table. This will let you find new ways of looking at the crawling results. For instance, you can group the results by Status Code parameter and define which status code is more common for certain type of pages: Notice that grouping is possible not only by one column but by several columns too. Imagine the insights you may get having set right combinations.
✔ Columns on/off
If you click on any column name with the right mouse button you will see a convenient panel: there you’ll have an opportunity to set any column’s view which is enabled in the ‘Parameters’ tab of crawling settings: Be aware that export is influenced by the settings, so the export file will include all the results the view of which is on.
✔ Freezing columns
Now you can freeze any suitable number of rows, however ‘Number’ and ‘URL’ columns will be frozen by default. In the future updates saving the column width, order and freezing are to be implemented. But, unfortunately, now these settings are saved only within the current session (i.e. until you close the app).
3.2. New internal tables
Types of tables
✔ Issues info
We are proud to present a new additional table, where you can see all the issues found on the crawled website. Here, you can filter URLs by the issue type, its severity, and the parameters in which the issue was detected:
The updated table that shows all the redirects / chains of redirects found on the page(s):
An absolutely new table, which contains really useful data about the link type, anchor, alt attribute (if the image is in <a href=""> tag), rel attribute, and even URL source view:
✔ h1-h6 headers
Each header has its own table: In case you need to analyze h2-h6 headers, do not forget to enable their crawling in the ‘Parameters’ tab of the crawling settings.
A new additional table, where you can find the data about all images found in the <img> tag on the page(s).
✔ Current Table Summary
Another thing we are proud of – a unique feature, that allows you to open the necessary information (issues, links, redirects, h1 headers or images) for the pages in the current table.
You could try to filter the table by simply clicking on any issue type in the ‘Issue’ panel on the right side (e.g. 4xx Error Pages: Client Error, if any) and then select Current Table Summary → Incoming Links. In this case, you’ll get a complete list of broken links:
Now every internal table has the option to be exported, just the same as the information in the main results tables.
A lot of new parameters you can filter the data by were added, and also such summary filters as: ‘All parameters’ (in this case all the cells in the results table will be filtered) and ‘URLs with issue’ (available only if the appropriate parameters are selected). "Length" is one more parameter you can now filter by→ any cell in the table can be filtered by its length.
Try to combine the two last features: first filter and then press ‘Export’ → only the filtered results will be exported in this case.
✔ Ways to choose the data
For your convenience there are now three ways you can choose the data:
- one URL → select any cell and call any internal table – thus, you’ll get the data only for the selected URL;
- a group of URLs → choose several URLs (with the pressed left mouse button or using SHIFT/CTRL key) and call one of the internal tables – in this case, the data will be grouped for the selected URLs;
- all URLs in the current table → click «Current Table Summary» button and choose any internal table; in this way you’ll get the data for all the URLs in the table.
Combining different ways of handling the data, you can work with the crawling results in the most efficient way. We’d be really happy to get your feedback, as far as we’ve put a lot of efforts into improving Netpeak Spider usability.
3.3. Highlighting the problems
Now, if a particular URL contains a mistake, only the URL cell and the parameters cell are highlighted, not the whole row. The color depends on the highest issue severity in this row or cell. We removed the opportunity to customize the table colors to show you the way we wanted to prioritize all the issues by their severity.
3.4. Better links distinguishing
Now all the links are divided into the exact types:
- AHREF → the most common link from <a href=""> tag
- IMG AHREF → so-called image links – images from the <img> tag inside the <a href=""> tag
- IMG → links to the images from <img> tag
- CSS → links to cascading style sheets
- Canonical → links from <link rel="canonical" /> tag in the <head> section
- Redirect → if Netpeak Spider detects redirect to any URL, it’ll mark that there is a ‘Redirect’ type link to this page
- LINK → to enable the detection of this type of links, you need to check crawling of the URLs from <link> tag in the ‘General’ tab of the crawling settings
- Meta Refresh → for detecting such type of links you should also check to consider Meta Refresh in the ‘Advanced’ tab of the crawling settings
Besides, we added some more parameters to every internal table with links:
- Alt → it is useful in case we deal with image links: anchor of such link will be an image alt attribute (if it exists) in <a href=""> tag
- Rel → use it to detect link with rel="nofollow" and other values of this attribute (learn more)
- URL Source View → a unique feature that shows you the original view of the link (just like crawler sees it), it’s helpful in case you need to find the exact link in the page source view
3.5. Ways to view the data and interact with it
We have completely reorganized all the tables having added new logic: if you can see underlined URLs or numbers, it means you can interact with them. For example, if you choose the underlined URL and press ‘Space’ button or
If you try to repeat these actions with the number of Incoming links – you’ll call the internal table, where all the Incoming links to this page/pages are viewed.
3.6. Other improvements in the tables
Real time work
You don’t have to stop the crawling in order to filter or export the data – now you can work with all the tables in real time, even during the crawling. For example, you can set data filtering in the ‘Filters’ table and start the crawling – after that, all the data will be automatically represented in the table in accordance with the filter set: that is extremely convenient if you are looking for some particular information on the website.
We offer three types of sorting: descending (by default), ascending, and ‘no sorting’ when clicking on the same column for the third time.
Dividing the tables into separate ones
We have divided the tables into separate, independent ‘All’, ‘Issues’, and ‘Filters’ tables. Now changes of the columns’ order or width in one table won’t cause synchronization with the other tables.
If there is no enough place to view all the information in the table, you’ll see the ellipsis (...). Try to put your cursor over the cell with the ellipsis and you’ll immediately see the tooltip with the full data inside the cell (notice that there will be no tooltips in case all the data is viewed). It allows not to expand the columns each time you can not see all the data.
You can handle the internal tables using hotkeys: F1-F8. Open the context menu by pressing the right mouse button on the table, and you’ll find there all the possible combinations.
4. Changes in crawling settings
4.1. New approach to handle the settings
Now default crawling settings are common for all projects. However, if you start crawling, the project settings will be saved, and the next time you select another project, you’ll see something like ‘Crawling settings of the current project and the selected one are different. Apply last crawling settings to the selected URL?’
Thereby you’ll be able to easily work both with specific settings for every project and as well as with common settings for all projects in case they are the same for different websites.
4.2. Settings comparison and autosave
Now the settings are being saved automatically every time you close the window or press ‘OK’ button. So don’t ever doubt, your changes in the different tabs of the settings will be surely saved.
To avoid definite inconveniences with the crawling settings, we have come up with the settings comparing logic → in case you have the same settings in various projects, you can switch between them without any pop-up windows. You will see them only if the settings differ.
4.3. New settings
Now you can disable the crawling of all MIME types, except for HTML files and redirects. It can be helpful when you don’t need to crawl, for instance, RSS-files or PDF documents.
- New setting to consider instructions from X-Robots-Tag in HTTP response header if it exists
- The logics of processing canonical URLs is improved → if you check to consider Canonical Link Element instructions, Netpeak Spider takes into account the content of this field in HTTP response header and gives it higher priority than the similar content in the <head> section of the page
- New setting, which allows parsing all the pages that return 4xx errors: notice that ‘Retrieve 4xx error pages content’ setting is off by default
5. Export of the results
- Export to Excel is advanced → now the results are exported as quickly as possible
- Export to CSV is added → this type is a perfect fit when working with a big amount of data
- Exported file name is now generated automatically, so you can at once see what table you worked in or what selection type you used in the internal tables
- A separate dialog box with export setting is removed → the reason is to shorten the time to get to the final result (i.e. exported file). Besides, the previous option to choose parameters for export is transferred to the crawling settings.
6. New projects structure, application data storage and crawling
- Renewed process of crawling, now its speed depends directly on the chosen parameters
- Modified structure of saving the results → unfortunately, with this optimized structure we were unable to migrate old saved projects to the new structure, so we are sorry to inform you that previous saved results files are not compatible with the new version of Netpeak Spider 2.1
- Compression of saved results that allowed decreasing file size by 4 times
- Increased by 3 times crawling speed
- An advanced system of ‘energy-consuming’ data storage on device hard disk that allows reducing RAM usage and thus crawl large websites
7. Other changings
- Due to all the changes described above and a totally new program architecture, the PageRank calculation has become unavailable for a short time. The nearest Netpeak Spider 2.1.3 release will provide you with an optimized Internal PageRank calculation logic!
- In-session filter saving for internal tables: Issues info, Redirects, Links, h1-h6 headers, Images
- Status Code parameter is improved, in particular, its informativeness. Besides, all status codes are now supported by this parameter and it won’t return ‘429 429’ anymore
- When uploading the crawling results, Crawled URLs and Crawling Duration parameters are being uploaded to the status bar to show the number of URLs and the time needed for crawling
- Now the program is loading more smoothly
The future is not set!
It is you who should influence the Netpeak Spider development – leave your feedback, ask any questions bothering you, share your ideas or support those of other users:
A quick recap
Now, when the detailed review is terminated, let’s sum it up. Netpeak Spider has become smarter, more flexible and powerful. And the benefits you are getting from the new Netpeak Spider 2.1 are:
- Detection of more than 50 on-page SEO issues
- 24 new parameters and the option to select/set them
- Absolutely new results table
- Improved logics of handling the data, including new internal tables
- Optimized export of the results and application architecture
- More convenient way to operate the results and more flexible crawling settings
If you haven’t had the opportunity to try Netpeak Spider yet, we are pleased to offer you a 14-day free trial that grants full access to all the tool’s features. In case you and Netpeak Spider are old friends, don’t waste your time to give the updated version a try until the global tool testing ends on Aug 19, 2016.
I’m proud of the work carried out and would like to get your feedback and advice on how to improve the program more. Come with me if you wanna do an effective SEO! And I’ll be back... with the updates!
We suggest you check out our next post from the Netpeak Software series →