#167 Ansible dynamic inventory caching
Opened 6 years ago by astepano. Modified 6 years ago

@cevich

Caching is more or less a requirement by the dynamic-inventory spec. Without it, there's no way to provide a uniform contents across multiple calls. Remember, inventory is an INPUT to ansible commands. Inputs should change during command execution.

@cevich I see, I am feeling that there is some very important thought.

Could you please explain more about Inputs should change during command execution.

Ansible inventory becomes test-environment.
We can define test_environment before any test run.
This test_environment should stay static.


Sorry, I meant "inputs should not change during execution". Though there are exceptions, generally once never expects: kill 1234 to also kill '4321' (assuming it wasn't a child).

In the case of Ansible, the only exception is if a play/role/task declares some state: Host Z must exist. That's really the only case where Host Z should be in the inventory. However, it should also not have any side-effects: e.g. a task "Host Z must exist" should never mean "Host Y does not exist".

For reference WRT why caching (persistent storage of inventory) is basically required:

From: http://docs.ansible.com/ansible/latest/dev_guide/developing_inventory.html#script-conventions

When the dynamic inventory provider is called...the script must output...all of the groups to be managed. Each group...list of each host, any child groups, and potential group variables, or simply a list of hosts

I believe what's implied by that is:

  • Inventory can be called more than once
  • All calls must represent ALL groups/hosts managed
  • Managed hosts might not actually exist
  • All else being equal, managed hosts/groups are always consistently represented.

The last point simply meaning, call 1 should be the same as call 2 and 3 and 4 and... so long as nothing else changes.


Here's the rub: If inventory state is to be declared by a play itself, persistence must be protected by locking, and it can't depend on system state (e.g. existence/absence of a VM in the process tree) - unless that state is also protected by the same lock. NOT doing this violates the first two, and last bullet above.

(Not to mention, determining inventory state from system state is really really really hard to do, in a general sense)

Make sense?

@cevich Hi!

I see your point. I agree with: "call 1 should be the same as call 2 and 3 and 4 and..." . I would add from myself: for same environment + same input arguments.

===

Let's say we have a script. This script generates dynamic inventory for Ansible. This script takes some arguments. This arguments could be ENV vars. Or some other input. This script produces output depending on its input.

As I understand you propose:

  1. This script takes into consideration its input only once.
  2. Stores its input as a cache somewhere.
  3. Ignores any input in further calls. Always uses its cache.

Right? For me this logic is a bit intricate. It is not obvious, not expected.

I have a question:
We have a Testing system / Tests / Test Runner https://fedoraproject.org/wiki/CI/Standard_Test_Interface .

Test runner:
1. sets arguments for ansible-dynamic-inventory-scripts.
2. invokes tests/tests.yml file.

Could you please describe a situation in (terms of testing-system->test-runner-> tests), when:

  1. ansible-dynamic-inventory-scripts could be called twice? I think Ansible calls them only once.
  2. ansible-dynamic-inventory-scripts are called second time. They are provided with some arguments. But, ansible-dynamic-inventory-scripts must ignore input and use cache?

Almost...

  1. Stores its input as a cache somewhere.

Script stores the result of it's actions in cache, so it knows what to do when it's called in the future...or, if it's called at the same time (ansible is multi-threaded and multi-process)

  1. Ignores any input in further calls. Always uses its cache.

Noooo, that's a terrible thing :D If the input clashes with existing state, return an error. Script must do what it can to honor it's input.

logic is a bit intricate

That's the side-effect of parallel-processing, it complicates everything to hell and back again :disappointed_relieved: Access to cache (persistent, known inventory-state) must be synchronized with a lock.

OTOH, outside of Ansible, there is no locking, for example if a "subject" is killed/crashes/stopped/started/changes. Since we cannot predict what a playbook will do, and if dynamic-creation during any play is also supported, it's impossible to (safely) also have a dynamic-inventory which is discovery-based (for example, querying for "subjects" NOT started/changed by an inventory script.

Could you please describe a situation.

  1. In this system, the playbook (one input) suggests inventory (secondary input. So neither is predictable - they could contain anything. However more importantly, the dyn. inv. spec does not say: "inventory will be called only once, ever". Any future version of Ansible can change any/most of its inventory-behaviors at any time, unpredictably.

Besides that, to the best of my knowledge, recent versions of ansible will re-query all inventory:

  • At the start of every play (- host: blah) - there could be more than one in a playbook and/or included playbooks.
  • When any subject declares a meta: refresh_inventorytask (could happen in parallel if multiple subjects).
  • Any time ansible needs to update hostvars, remember inventory provides _meta - details/facts ansible otherwise knows nothing about, nor need to store itself.
  • Other tasks (guessing/not sure) like group_by, add_host, or tasks that define degate_to.
  • When any ansible command references it (for example a play could run a script, or make a call directly). Though almost always frowned upon, we cannot predict what craziness will be inside a playbook :confused: yet we are required to try and do the "right" thing - perhaps crashing gracefully.
  1. No, this is not right. Cache (in general) is simply persistent storage (like a file). It usually stores the necessary bits, for an application to retrieve "what happened before". It uses this for logic in future actions, and/or to perform them faster.

Specifically, for ansible-dynamic-inventory-script, they only need cache to store necessary details to represent it's previous inventory state (hosts, groups, host_vars and group_vars) in future calls. The most simple example of this could be:

A bash script that echo's {} when called with --host, and cat /path/to/file (wrapped in a flock). In this case, /path/to/file is the cache, and we can assume whatever creates/changes it also does so under a flock. In all cases, this script will never have side-effects if called concurrently from multiple processes. It will always (safely) represent inventory state prior to its call-time (because of the flock).

...a more concrete example follows...

...concrete example:

  1. ansible-playbook -i /usr/share/ansible/inventory/ several_plays.yml starts - hosts: subjects
  2. standard-inventory-foo creates subject-foo.
  3. standard-inventory-bar creates subject-bar.
  4. several_plays.yml applies a_role.
  5. a_role declares a_script is to be run, delegate_to: localhost:
    #!pseudo-bash
    current = $(ansible-inventory)
    if inventory_contains(current, 'subject-baz)
        exit 0
    else
        touch $LOCK_ON_FILE   # from /etc/environment
        /usr/share/ansible/inventory/standard-inventory-baz --some --needed --options
  1. The next play of several_plays.yml starts - hosts: subjects.
  2. testing things happen on foo, bar, and baz*.
  3. a_handler is notified by some_role.
  4. Play ends, a_handler calls other_script (also delegate_to: localhost):
    #!pseudo-bash
    current = $(ansible-inventory)
    if inventory_contains(current, 'subject-baz)
        rm $LOCK_ON_FILE
    else
        exit 0

Assuming no unwritten shenanigans occur, and all inventory scripts do proper flock()ing and caching, in what three ways will this playbook fail randomly on the same line of code?

(Note: Yes, there is an easy fix to this problem, example is about Ansible + inventory + concurrency)

An answer: https://paste.fedoraproject.org/paste/1T2OifyQkOzJ0JctK~sSmA
(others are possible)

Key facts:

  • Ansible cannot be relied upon to do any locking/synchronization of it's own, it must all be done from scripts, or modules before or during a run.
  • Parallel $ANYTHING, is really really really hard to do w/o race-conditions / random side-effects (assuming efficiency isn't killed by simply locking literally everything)
  • Ansible runs tasks in parallel across subjects (including localhost) by default
  • Though generally undesirable, there may be more than one ansible-command running at the same time.

Hi.

A few questions.

1.
If we rewrite standard-inventory-foo and standard-inventory-bar to use some cache, how this will help solve above issue-example ?

2.
Should standard-inventory-foo and standard-inventory-bar ignore baz ?

1) Shouldn't be a re-write, that would suck. Easy-path is we just take something like invcache and re-use it across all the inventory scripts.

How this will

Oh, sorry, I tried to limit the number of possible answers from $TOOMANY with: "assume all inventory does proper locking & caching". The exact number depends on the answer to your second question (below). If it's non-shared cache w/o locking, both foo and bar will race to create baz, maybe creating two baz's. Ansible may apply the second play to the "wrong" bar, or no bar subjects at all. Then all subjects could also race to remove $LOCK_ON_FILE (I think that's all the races).

2) I would say yes, they should not share the same cache, but it's not a requirement, just a simplification. It might be desirable to share cache, as protection against two different inventory scripts trying to manage a host with the same inventory-hostname (hopefully unlikely). Separate caches reduces the number of shared resources, therefore likelihood of races.

Assuming a shared clock, race-condition opportunities compound exponentially by: The number of actors to the power of, the number shared-resources to the power of, the number of access points (IIRC) - a big number, very fast.

TBH, the easy/simple solution is to have no cache, no locks, and put all subject details into static inventory files. i.e. higher-layers manage host creation/destruction. This also implies no subject creation during playbook runs (add_host is evil and and runtime-addition adds same locking/caching complexities).

In other words, all of this is because I believe the requirements are: inventory-script subject creation, and inventory-script in-play subject creation and inventory-script in-play subject destruction. Each requirement adds complexity, if simple is desired, subtract requirements :D

@cevich Thank you.
I would propose to go this way:

no subject creation during playbook runs (add_host...). Remove add_host from current implementation. This also implies no subject creation during playbook runs I also want to make this a part of policy for STR.

Also possible approach:

put all subject details into static inventory files.

Just generate static inventory from dynamic before test run. This fill help illuminate you concerns about changes in future and multiply calls to dynamic inventory scripts.

I also like: higher-layers manage host creation/destruction.

Let's take this direction.

I am sorry, but I still believe that caching-in-inventory-scripts will bring more confusion that benefits.

no subject creation during playbook runs

Yes, outright banning that will cut down on the number of failure-modes quite a bit.

Just generate static inventory from dynamic before test run. This fill help illuminate you concerns about changes in future and multiply calls to dynamic inventory scripts.

This is the simplest approach, no locking or cache needed. For the scripts, the job can be made even easier if they write YAML-format inventory. Simply because Ansible's ini-like format doesn't conform to most built-in ini-parsers.

higher-layers manage host creation/destruction.

Going that way brings another benefit: You may not even need to write any subject creation/destruction scripts. There are already many examples of these, or for something close to home: It may be worth looking at the linchpin project. I haven't used it, only read through the docs and asked some questions. But it's been under active development for a few years, so should be fairly stable (and it would be one-less thing to maintain).

I am sorry, but I still believe that caching-in-inventory-scripts will bring more confusion that benefits.

No reason to apologize, it's better to make an informed decision than run half-blindly forward. Just because Ansible supports some feature, doesn't mean it should always be used :grin: Having complex, dynamic-inventory, and subject management introduces a lot of complexity, and much of it isn't even documented.

Having both inventory and host-management out of the way, allows more focus on this repo's name-sake: roles. Those are the real bread, butter, and the jelly for all playbooks. Writing good, idempotent roles (and unittests for them) is plenty enough work to do.


Side-note: There are some good examples of role-unittesting in ansible-galaxy. Though, I would caution against using role-dependencies, they're almost universally believed anti-pola.

Oh, and please let me know how I can help.

@cevich Hi!

Thank you for your answers!

I like idea of higher-layers manage host creation/destruction. I played today with linchpin. I like its ideas. But I am concern about next:

  1. The only one way to install LinchPin is to use PyPI. STR already has a lot of RPM dependencies. It is a bit scary to become dependent on PyPI. It installs full stack of necessary Python modules for all supported providers (openstack, libvirtm, aws, gcloud, beaker, duffy, ovirt, openshift) + new ansible + ....

  2. I tried to run LinchPin with dummy/libvirt providers on Centos/Fedora. Without success.
    https://gist.github.com/Andrei-Stepanov/bfabc363e051956d14eb9b73102e6487

Do you know alternatives to LinchPin ?
How do you think. is it reasonable to fork LinchPid directly to STR and keep only required functionality?

But I am concern about next:

Ahh right, yes, it does have a TON of dependencies, and pypi can introduce security and stability problems.

Do you know alternatives to LinchPin ?

I'm sure we can google around and find many. Though I would expect to find (rather quickly), most existing solutions in this space are going to bring along dependencies :frowning:.

Even if the host-creation script were to use an ansible playbook w/ the cloud modules,. Many of those also require dependencies. However, the bigger problem is that Ansible development is on a parallel track to the distros. So those dependencies may not always sync up with the pace of Fedora.


In either case, I have experience dealing with, securing, and stabilizing pypi packages (inside a virtualenv) for production-testing. It adds some maintenance burden any time an upgrade of any component is needed, but that should be infrequent. The giant benefits of re-using linchpin here would be:

  • A uniform input-format for subjects, irrespective of the cloud-technology used.
  • Someone else maintaining this component.
  • It already outputs usable Ansible inventory.
  • Leveraging an existing community that's responsive to issues and PRs :smile:

I can do some more investigation of this if you like.


A third option could be to repurpose the creation / destruction pieces from the existing inventory scripts. Those parts comprise the bulk of their usefulness anyway. It's trivial to re-write main() and argument processing. The inventory-processing output just needs writing to a file, and small tweaks to the data-structure.

Though, I'd suggest moving the scripts into a separate repo., or at least installing them in /usr/bin/. This route also lends itself to re-using shared modules for common functionality. This in turn makes debugging and unittesting of discrete-components much easier.

I'm also happy to help here.

no subject creation during playbook runs

Yes, outright banning that will cut down on the number of failure-modes quite a bit.

Just generate static inventory from dynamic before test run. This fill help illuminate you concerns about changes in future and multiply calls to dynamic inventory scripts.

This is the simplest approach, no locking or cache needed. For the scripts, the job can be made even easier if they write YAML-format inventory. Simply because Ansible's ini-like format doesn't conform to most built-in ini-parsers.

higher-layers manage host creation/destruction.

If subjects can not be created during playbook runs, how will one declare the subjects to use, if one needs a multi-host test ( #92 )? Since the whole test suite is supposed to be declared in the tests/tests*.yml playbook(s) in the dist-git repo (right?), how will the test launch additional VMs if doing so from the playbooks is forbidden? So far there has been a way, albeit very inelegant (as discussed here), using add_host: see the code example in https://pagure.io/standard-test-roles . I understand that the proposal is to ban it, what will replace it? Since the subjects are created according to the TEST_SUBJECTS environment variable, which is supposed to be set by the testing system, the test suite does not have a way to set the variable. Or are such requirements expected to be specified via some future FMF attributes instead?

Yes, this implies there must be some mechanism that allows the suite to specify the set of desired/expected subjects. It also makes the initial-state verifiable, by the playbook itself. Obviously documentation updates would need to follow such a change.

I cannot imagine a scenario where the complete set and type of required subjects, cannot be known ahead of time. Even if it's a big test, with multiple plays, I promise that automation won't mind if a set of subjects just sits there, patiently waiting for their turn at the end :smile: But if I'm wrong, please explain.

There are many ways this could be done. Since $TEST_SUBJECTS and several other env. vars are well established, I'd propose amending the standard. Simply have the testing system do "pre-playbook" sourcing of a bash-script containing contents for these variables.

This puts the test-environment definition close-by the test-content (playbook). It would also support (later) extending the standard (with more/other variables) easy as well.

Yes, this implies there must be some mechanism that allows the suite to specify the set of desired/expected subjects. It also makes the initial-state verifiable, by the playbook itself. Obviously documentation updates would need to follow such a change.
I cannot imagine a scenario where the complete set and type of required subjects, cannot be known ahead of time. Even if it's a big test, with multiple plays, I promise that automation won't mind if a set of subjects just sits there, patiently waiting for their turn at the end 😄 But if I'm wrong, please explain.

Good. I just wanted to make sure that this is taken into account somehow.

There are many ways this could be done. Since $TEST_SUBJECTS and several other env. vars are well established, I'd propose amending the standard. Simply have the testing system do "pre-playbook" sourcing of a bash-script containing contents for these variables.

Since $TEST_SUBJECTS essentially specifies what the testing system wants to be tested, shouldn't rather the test suite use a different mechanism for declaring what resources it needs to accomplish the testing? I.e. shall the additional VMs for multihost testing be considered test subjects at all?

This puts the test-environment definition close-by the test-content (playbook). It would also support (later) extending the standard (with more/other variables) easy as well.

I think using the flexible metadata format is more, er, flexible and also more declarative (if one sources bash scripts with variables, people will be tempted to program in them ...)

I think using the flexible metadata format

I'm not married to env. vars. or sourcing a script, was just a thought. Any extremely trivial key/value format is all that's needed. Maybe a CSV, INI file, or similar. I dunno the details about 'flexible metadata format', so I cannot comment.

Though based on the discussion above, it sounds like simplicity is king. So maybe literally, just:

somefile.conf:

key1 = value
key2 = first_item
       second_item

Login to comment on this ticket.

Metadata