Frequently asked questions

Getting Started

  • What does Flock do?

    Flock is a content analysis tool that makes it easier to crawl sites, collect content inventories, and analyse your content.

    All the data is collected in a collaborative space - making it easy for distributed teams to assess and manage content together.

    Flock gathers data on the quality of your content for every page, including:

    • URL structure
    • Title tags
    • Reading ease
    • Reading time
    • Word count
    • Content age
    • Missing metadata
    • Media files

    It also gives you tools to assess and manage the quality of your content:

    • The Site Report gives you a quick analysis of the overall health of your site’s content.
    • The Content Inventory allows you to filter your content by different metrics to easily find and address problem areas. For instance, you might want to filter to find pages that are missing metadata, or which content looks outdated.
  • How do I start a crawl?

    You can add a new crawl using the green “Add a site” button on the "Your sites" page. You can return to the "Your sites" page at any time by clicking the "home" icon at the top left of the screen. After clicking "Add a site" you will see a form appear that allows you to enter the URL and a name for the site you're about to crawl. Note that if you enter a URL that includes a subdirectory ( i.e. http://www.example.com/subdirectory ) you will be given the option to either crawl the entire site from the root, or only crawl within the subdirectory.

  • How long does a crawl take?

    When you add a site on Flock two things happen. First we crawl the site, then we analyse the data we collected. The first process, the crawl, goes pretty fast. It takes about 10 seconds per page.

    The second process, analysis, takes longer. There's a process that happens at the end of each crawl that takes a few minutes. Because of this, smaller sites actually take longer per-page than larger sites do.

    Average crawl times as of Sept 19, 2017:

    • A 1 page site takes about 4 minutes

    • A 30 page site takes about 15 minutes, or 30 seconds per page

    • A 1000 page site takes about 4 hours, or 15 seconds per page

  • How to crawl dynamically generated content

    Background

    In general, web content is either "static" or "dynamic". An example of static content might be a "contact us" page that shows contact information. The details might change over time, but usually you'll see the same content there. In contrast, dynamic content changes often, in some cases constantly. An example of dynamic content might be a news site which always shows the most recent articles. Search results are another example of dynamic content. The results change depending on user input and are usually different on each viewing.

    Crawling dynamically generated content can get complicated. Consider a list of search results that includes links to related search terms. When a crawler follows a link to related search results it will see a new list of results with new related terms. In this case, the act of crawling the site is in effect generating new pages to crawl. Each set of results might span many pages, each with their own list of related terms. This kind of situation can result in a crawl that includes thousands of pages of search results.

    In some cases, dynamically generated websites query external sources for content. Consider for example a site that aggregates results from Twitter related to a given keyword. In cases like this, the number of "pages" on the site is almost limitless. As long as there is more content to aggregate there will be more results to show.

     

    Problematic results

    The problem described above can have some negative results for Flock users. One such result is that you could wind up with a site that never stops crawling. To mitigate this problem, we've limited crawls to a maximum of 100k pages. This keeps crawls from continuing forever but it still leaves you with a problem. Let's say you expect your site to have about 20k pages, but due to a large number of dynamically generated pages, your crawl is maxing-out at 100k pages. In such a case, it's possible that all 100k pages could include only search results, and not include any of the pages you actually wanted to audit.

     

    The solution

    Flock now has a feature that lets you "blacklist" certain URLs when setting up a new crawl. Flock ignores any URLs in the blacklist when it crawls your site.

    For example, if you wanted to crawl a site but ignore all search results, you could add "http://www.example.com/search" to the blacklist.

    The powerful thing about the blacklist is that it doesn't need complete URLs, it can also process partial patterns.

    So instead of "http://www.example.com/search" you could enter "/search" and get the same results. You could also ignore all URLs containing query strings by adding "?" to your blacklist.

     

  • Can I pause or cancel a crawl?

    Our crawler doesn’t currently have a pause or cancel function. (If you’d find that a useful feature, let us know).

  • Can I restart a crawl?

    After you’ve completed a crawl, you can restart or redo the crawl by going into the site’s Configure page. Once there, you can click the “Recrawl site” button to kick off a new crawl and analysis run. As always, Flock will notify you when the crawl is complete.

  • My crawl has failed. What should I do next?

    Try restarting your crawl. Errors can occur for lots of reasons, and may not happen on the second attempt.

    Email us and let us know, too! We’d like to try figure out what’s gone wrong.

  • How do I view my results?

    You can view the results of your crawl in two different ways:

    • Site Report: This report gives you a summary of the health of your website content. It can be accessed by selecting the “Site Report” icon on the Crawled Sites landing page, or using the navigation at the top of screen from the site’s Inventory page.
    • Inventory Page: The inventory view gives you a full list of all the pages that have been crawled, along with key metrics for each page, such as reading grade, word count, and missing metadata.
  • How do I download my crawl data?

    There are two kinds of data you can export from Flock: inventory data (also called metadata) and content data. The inventory data includes all the metadata about each page that Flock finds when it crawls your site. This includes the URL, the page title, how many issues we found on the page (broken links, for example), and anything else we know about the page. The content data includes all the text found in the main content areas of your site. In order to avoid needless repetition, Flock automatically filters out text that appears on every page, like main navigation or sidebar elements, and collects only the text that appears to be main content. 

    There are two ways to download your crawl data from Flock: 

    1. Download from "your sites" list

    Click the download icon for the site in question. A pop-up will appear giving you the option to download either the "metadata" or "content". Keep in mind that, depending on the size of your site, the content download can result in a very large file.

    2. Download from a site's inventory

    Navigate to your site's inventory page, then click the "download CSV" button. Note that this will download the filtered data in your inventory results, so if you apply a filter first you will only download the filtered results. This is useful if you'd like to get a list of all pages with broken links, for example.

     

  • Why does my site have zero pages?

    When Flock crawls your site, it checks to see if you have a "robots.txt" file. The robots.txt file informs web robots like Flock's crawler about which areas of a website should not be processed or scanned. If you can view the site in a browser but Flock reports the site as having zero pages, it's probably being blocked by your robots.txt. To check if this is the case, append "robots.txt" to your site's root URL, like this:

    http://www.[your-site-here].com/robots.txt

    If the file exists, you will see a number of lines of text describing your site's preferences regarding bot behaviour.

    Solution:
    The best way to address this is to remove or edit your robots.txt file so that it allows Flock's crawler access. If you'd like to provide access specifically to Flock's, you'll need to add a rule to your robots.txt file that creates an exception for our crawler. To do that, you'll need to know that our crawler calls itself "Scrapy/1.0.4 (+http://scrapy.org)", but a simple "Scrapy" should suffice.

     

  • Can Flock scrape content from my site?

    Yes, Flock can scrape content from a website.

    You can download your content in CSV format by clicking the "download" icon on your sites list, then selecting the "content" option. For large sites it will take a while for Flock to compile the content for download, so please be patient.

    Currently this "download content" feature has some shortcomings:

    • whitespace is stripped
    • text formatting is stripped
    • unwanted HTML elements in content 
    • Excel limitations

    Whitespace is stripped
    While spaces within sentences remain intact, spaces between structural elements do not. This can causes issues where words come together without a space to separate them. 

    Text formatting is stripped
    The CSV format we're using for data downloads doesn't support the use of rich text formatting. This means that we lose any bold, italic, or other styles that help separate content into chunks. We also lose other structural formatting such as headings, paragraphs, and lists. 

    Unwanted HTML elements in content
    While stripping unwanted HTML elements from your scraped content is a good thing. However, our current implementation contains some errors. For example, Flock currently leaves in <!-- HTML comments --> and some less common HTML elements such as <video>.

    Excel limitations
    This is only an issue if you have page content that's longer than 35,000 characters. If you use Microsoft Excel to view the CSV, you may run across this issue. Excel has a limit of about 35,000 characters per cell, which means that if your content is longer than that your data will "get weird". When a cell contains too many characters, Excel tries to place the extra characters into a new cell. New cells created this way aren't formatted properly. The result is a messy spreadsheet with random inconsistencies in formatting. The simplest solution to this issue is to use a different spreadsheet program. My personal preference is OpenOffice, which is entirely free.


    What we're planning to do
    Our vision is to provide content scraped from your site in a useful format. We've started experimenting with moving from the .CSV format to .ODS, which would allow us to use rich text formatting. We also plan to improve our treatment of whitespace and HTML elements.

Google Analytics

  • How do I add analytics to my account?

    Adding Google Analytics to your Flock account allows you access all of your content analytics in one place. Here’s how:

    Analytics detected message
    Analytics detected message

    On the report page, click on the orange “access google analytics” button. You’ll then be prompted for your google credentials in a separate window.

    Select your google analytics account and then allow access. Once that window closes, you’ll have access to your analytics from you main report view

    Flock's google analytics interface
    Flock's google analytics interface

     

  • What should I do if my Flock account is not associated with my Google analytics account?

    If you’re having difficulty associating your analytics account, please try the following steps:

    Click the configure button on the Flock navigation bar. Under site navigation you’ll see the google analytics panel.

    Flock's Google Analytics panel
    Flock's Google Analytics panel

    Ensure that the account associated with your analytics is correct by selecting the account tab. Once the credentials are accurately assigned you should then have access to your analytics. Click on the report button and your website’s all time stats should appear underneath the google analytics heading.

  • Why isn't the Google Analytics data included in CSV download?

    In short, it's because Flock doesn't actually store your Google Analytics (GA) data.

    Instead, Flock polls your GA account for data on an as-needed basis. For example, when viewing a site's inventory, the Flock system retrieves your GA data for the pages shown in the inventory in real time. This is workable because the inventory is limited to show up to 50 results at a time. However, when downloading a CSV, Flock compiles data from all the pages in your site, which may number in the thousands, and there is not currently an easy way for Flock to ingest that much Google Analytics data at once. We are working on a solution to this problem. 

  • Why is my pageview count different when I view it in Flock vs when I view it within Google Analytics?

    Flock helpfully aggregates all of your page’s permutations. Google does not. What this means is that Google may see pages with similar URLs as separate and unique. For example, Google may see these URLs as different pages:

    • http://example.com
    • http://example.com/
    • http://example.com/index, and
    • http://example.com/index.html

    Flock combines the stats for these pages to provide a more accurate figure.

Troubleshooting

  • Why is my site language reported as Unknown?

    The Flock crawler finds information on language in the HTML of your pages. For example, if you have a French page, the language tag would look like this:

    <html lang="fr">...</html>

    If this tag doesn’t exist, the language value may be reported as “Unknown”.

    Language tags are a good thing. Here’s why:

    • Accessibility: they signal to the pronunciation engine of screen readers to switch to another language.
    • Search engines: they help search engines rank your pages for searches made in the same language.
  • What is a server and why is it unknown?

    Figuring out the server of your site is more of a geeky thing. It can give you hints about the technology that’s powering your website.

    As an agency, this can help you figure out if the technology powering a website is one in which you’ve got expertise. It also helps you optimize your approach for the technology you’re working with.

    Our crawler finds server information in the HTTP header of your homepage. If it’s not provided, this value is reported as “Unknown”.

  • Why is my content age reported as Unknown?

    Content age is retrieved from the HTTP header of your webpages. If it’s reported as unknown, it means the Flock crawler was unable to find this information.

    If a date’s not provided in your HTTP header, this might indicate an issue with your server or your Content Management System (CMS) - both of these should be adding this metadata to your content.

  • Why is the audit preview blank?

    HTTP (Hypertext Transfer Protocol) is the protocol that enables browsers to connect to the World Wide Web. HTTPS is the secure version of HTTP (the "S" in HTTPS stands for "Secure") When using HTTPS, all communications between your browser and the website you're viewing are encrypted.

    If you try to view insecure web content over a secure connection, it results in a security error. 
    Because Flock uses HTTPS, when attempting to preview pages over HTTP this security issue prevents those pages from loading. 

    The solution:
    If the site you're crawling is available in HTTPS, try re-crawling using the HTTPS version. 

Inventory Page

  • How do I filter results?

    Just below the Inventory page title, you will see the following:

    The inventory filters interface

    The inventory filters interface

    Searching with the text field will filter by page title and URL. Alternatively you can select any number of filters then click the "Filter inventory" button. Click "Reset all filters" to clear all filters at once.

  • Why is the page count on the inventory different from what is represented on a Site Report graph?

    This occurs when a page's value (reading grade, reading ease, etc.) is equal to the separating value on the respective graph. The site report graphs interpret "between" as "greater than the lower value and less than or equal to the upper value" while the inventory "between" filter performs "greater than or equal to the lower value and less than or equal to the upper value". The result is a rare discrepancy between the inventory and the site report graph.

  • How to produce a content inventory in 3 easy steps

    A content inventory is a catalog of every piece of content on your website. Producing a content inventory manually can be hell, especially for large sites. Manual inventories can take days or weeks to produce. Luckily, Flock makes producing a content inventory quick and easy. Here's how to do it.

    Step 1 – Crawl your site

    Log in to Flock and click "Add a site", then enter your site's URL and click "Add this site". Your site is now queued to be crawled. Easy.

    Step 2 – Wait a bit ...

    Depending on the size of your website the crawl could take a while. Small sites finish quickly but large sites containing many thousands of pages can take hours or even days. On average it takes about 8 hours for Flock to crawl a 10,000-page site. As soon as Flock has crawled and analyzed your site data you will receive an email notifying you that your crawl is complete. Click the link in the email to view your results.

    Step 3 – Content inventory

    When your crawl is complete, start by checking out the "site report" page for high-level insights. To get a more detailed look at your full inventory, click "Inventory". Here you can find every page on your website, perform searches, filter and sort your data. Download your data as a CSV to bring it into your favourite spreadsheet for more detailed work.