Skip to Main Content
Feature Request FR-3105
Product Area Application Builder
Status CLOSED

4 Voters

Dynamic sitemap and robots.txt

ron Public
· Apr 27 2023

Idea Summary
Include the ability to dynamically generate the robots.txt and sitemap files.

Use Case
I run public-facing consumer sites. Google ranking is not a luxury – it's a requirement. Vanity URL's and friendly URL's are great, but Google still has to crawl them. I can use Insum's solution for dynamically creating a sitemap, but I'm stuck at present (posted a forum note on it today too) in that ORDS disallows crawling everything by default. Anyone who publishes an APEX site on OCI needs access to that file in the short-term or search engines cannot serve up the app. Long-term though, it makes complete sense for both robots.txt and sitemap to be generated. 

Preferred Solution (Optional)
Include a shared component section for SEO with friendly URL's, sitemap settings with a link to be used for REST config, and a dynamic instead of static robots.txt file.

Short-term though-- we really at least need to be able to edit that robots.txt file to make the improvements to friendly URL's and SEO work or search engines just won't work.

This is a great idea! You can already achieve this in APEX today with a slightly different approach.

Comments

Comments

  • ron OP 10 months ago

    I've tried so many things, from adding the built-in meta tags, to dynamic sitemap, and it still ALL comes down to the robots.txt on OCI being set to Disallow all. While dynamic would be amazing, to have robots locked down for even manual edit and default to disallow seems more like an oversight. Could there be a short-term solution on this maybe? It neuters a lot of the awesome features for seemingly no reason.

  • ron OP 10 months ago

    https://kilroysworkshop.com/ords/wksp_eduslate/robots/robots.txt

    It doesn't live in the root folder, but thinking maybe a routing rule would do the trick. Mentioning here as a possible method of implementation as a REST module.

  • ron OP 10 months ago

    I wrote the following handler for a REST module - named robots.txt.

    DECLARE
       l_htp htp.htbuf_arr;
    BEGIN
       -- Set the HTTP header
       htp.p('Content-Type: text/plain; charset=utf-8');
       htp.p('');
    
       -- Write the robots.txt file contents
       l_htp(1) := UTL_RAW.CAST_TO_RAW('User-agent: *');
       l_htp(2) := UTL_RAW.CAST_TO_RAW('Allow: /');
    
       -- Output the file using HTP
       FOR i IN 1..l_htp.COUNT LOOP
          htp.prn(UTL_I18N.RAW_TO_CHAR(l_htp(i), 'UTF8'));
          htp.prn(CHR(10));
       END LOOP;
    END;
    

    I logged a ticket and they suggested doing a load balancer redirect to a file. I did the redirect to this URL instead and that solved it. This method allows for the dynamic generation of the contents along with the sitemap in similar fashion (see the INSUM reference earlier). Not sure how to get this declarative, but it would be the preferred solution for sure.

  • todd.bottger APEX Team OP 8 months ago

    The shared ORDS that is included with ADB currently does not allow customers to easily override/edit its default robots.txt. This is a request to ORDS team to change that - provide a nicely packaged way to override (as part of Database Actions UI?)

  • ron OP 8 months ago

    Hi Todd -

    Having the ORDS team do that is one possibility on the robots side, though sitemap is still an APEX thing even if we can update the robots.txt file itself. The robots.txt file is present and is hardcoded to Disallow, so no matter what is done for SEO no crawler will touch it, and Google actually posts a message that the site cannot be crawled. 

    The request is to either have a way to replace/update the robots.txt (ORDS team), or to make a declarative method to override it like I did as part of APEX as noted earlier. We have the meta tags in the UI now, which is awesome. Adding a dynamic robots and sitemap file (for dynamic pages) as a declarative setting would solve the problem. 

    Even if ORDS makes the robots.txt editable the sitemap issue (especially for dynamic pages as noted in the original post) is still an issue as Google won't do that. A good example of sitemap crawling issues is here:

    https://kilroysworkshop.com/ords/r/eduslate/workshop/more-info

    Lots of content, but none of it crawled without making each page an entry in the sitemap file. 

    Hope that clarifies my request. Feel free to reach out with any questions. 

    Btw - couldn't have done our implementation without the tutorials you did. Much appreciated! My biggest implementation headache beyond the SEO is integration of the Stripe Payment Elements (one of my other enhancement requests). All else is working great. I even tested out a SAAS solution for creating multiple load balancers to hit different APEX apps with a master application and common database, so custom URL's will now work going to a single ADB but they have completely custom UI elements. I go back to WebDB days at Oracle and am blown away by what you all have done with APEX. Awesome.

  • vladislav.uvarov APEX Team OP 8 months ago

    As far as robots.txt on OCI Autonomous Database (ADB)…

    First, a few background points to explain the current behavior (June 2023):

    1. ORDS on ADB is scoped to the /ords URI path. It does not control anything outside of this path, including /robots.txt. The only time /robots.txt would be under ORDS control is when you deploy the customer-managed ORDS (standalone), where the entire docroot is handled by the Jetty embedded in ORDS.
    2. The /robots.txt is global for the entire hostname. From the general-purpose service perspective, it wouldn't be appropriate for one APEX app, or one APEX workspace, or even for ORDS to provide their own robots.txt for the entire hostname. On ADB, ORDS is just one of several software components (tools) using given hostname.
    3. ADB does not offer the ability to host static websites on the default “apps” hostname and so there is no option for a hostname-specific (customer-specific) robots.txt.
    4. Every customer will have different requirements for crawlers, and most customers will not want their apps crawled at all (even those apps that may be exposed to the internet). It is impossible to come up with a set of crawling directives that will be suitable for all customers and all apps, so the default safest policy is to disallow all crawlers.
    5. If a customer is interested in SEO, the vanity URL must be step number 0. From there, it is easy to implement a solution for a custom /robots.txt (I like #4 below, which is entirely declarative).

    So, the possible solutions (some of which were already mentioned in this Idea):

    1. A web server on Compute for the static website, including /robots.txt, or the customer-managed ORDS with its own docroot, fronted by the vanity URL LBaaS. This would be an overkill for just the robots.txt requirement, but customers may also be interested in other benefits these options bring. No extra cost (can use Always Free resources).
    2. An ORDS RESTful module that prints out the desired static or dynamic robots.txt directives, and an LBaaS Rule Set with the 302 URL redirect rule for the /robots.txt path. Crawlers usually follow these 3xx redirects (see here). The same RESTful module can dynamically generate the Sitemap based on APEX dictionary views. No extra cost.
    3. A static robots.txt in a public Object Storage bucket, and an LBaaS Rule Set with the 302 URL redirect. Object Storage comes with a large number of free requests, but even if exceeded, the cost would typically be under $1 per month.
    4. A Web Application Firewall (WAF) rule for Request Access Control, which is applied to the /robots.txt path and performs the HTTP Response Action to return your static robots.txt content as the response body with 200 OK status code. This essentially intercepts and overrides the default /robots.txt. Attach this WAF to your vanity URL LBaaS. No extra cost (but you need to use a Paid account and be within the number of requests limit) and you get all benefits of WAF.

    With all that said, I still think it is worth exploring this Idea for a better robots.txt and Sitemap experience for specifically public APEX apps.

  • lhe OP 8 months ago

    @vladislav.uvarov I implemented your suggestion of #3 and it worked! The key is the location of robots.txt is actually unknown, so it needs a suffix match (with vanity URL). This seems to be the easiest workaround. The rest is just wait… It took Google 5 days to re-index the page even though I explicitly submitted a request for re-index. So be patient.

    Thank you very much for the suggestion!

  • martin.dsouza APEX Team OP 5 weeks ago

    Follow up on how to generate sitemap.xml using a query: https://talkapex.com/how-to-generate-a-sitemapxml-in-sql